The NSF Internet Archive project aims to coalesce a community of social science researchers focused on tackling next-generational questions of Internet research using the Internet Archive (IA), the largest single record of the history of the World Wide Web, dating from 1995 to the present. In partnership with Northeastern University and the Internet Archive, we aim to: (1) develop the prototype tool, HistoryTracker, as a powerful Internet research tool to support their research project, and (2) create sample databases to demonstrate how to conduct research using archival Web data, particularly the longitudinal studies this historical record makes possible.

The project is currently maintained on HUBZero, and can be found at The website provides access to all the source code for the tool, and is open to researcher access.

Click here for information on a conference I’ll be co-hosting on Internet Archive research at Harvard University.

The long-term aim of this project is to establish a vibrant and self- sustaining community of scholars examining research pertaining to the growth and evolution of communication and interaction in digital spaces. Furthermore, the research team will conduct an exemplar research project under the grant will focus on the emergence and evolution of political organizations online during national election cycles. In addition to studying the organizations that are the focus of the demonstration project, the researchers will examine and report on their own organization as a community of multidisciplinary scholars learning to work together on data- intensive research.

Internet Archive. In the realm of social science, and across disciplines, archival Internet data represent a vast repository of untapped research potential. For public audiences, the Internet Archive repository has proved immensely popular; the public Wayback Machine ( interface to the Internet Archive serves 300,000 visitors a day, and more than 200 requests a second. Currently, the Internet Archive contains more than seven petabytes of data and offers a reliable historical record of Web sites dating from 1995 to the present. In terms of data availability, the Internet Archive is by far the largest digital source for historical research pertaining to the Web and its contents over time.

The Problem. Although IA contains billions of Web pages and has tremendous potential to facilitate research, research languishes because the search, crawl and extraction functions are severely limited. For example, to browse IA resources, one generally needs to know the target uniform resource locator (URL). A researcher cannot run a full-text search of the entire back history of the Web, nor are there facilities to download sets of related Web pages except for manual download. Despite the potential for research using data from the Internet Archive, there has never been an article published using IA data in major academic journals such as Science, Nature, Journal of Communication, Academy of Management Journal, or the many other leading disciplinary journals. This reflects the current barriers to accessing large-scale data from the Internet Archive. Thus, a primary objective of this project is to tear down those barriers and simultaneously bring together a new community of researchers. In recent years, IA has initiated a series of funded projects aimed at making its data more amenable to large-scale research; this effort has included layering metadata into IA files. In combination with recent computing advances, such as the development of Hadoop computing technology, an opportunity presently exists to make significant advances toward fulfilling the potential of IA for social science research.

HistoryTracker. The prototype tool will enable analysis of unprecedented amounts of Web-related data in the social sciences. Of particular importance, HistoryTracker will give researchers the opportunity to understand how the Internet is changing over time. Mastering research with large-scale data, both in terms of building usable sets of Web pages and visualizing the networks of topics contained on the Web historically, will be one of the major challenges for scholars in the years to come as the Web continues to grow as a key source of social information. Most efforts to understand the structure and impact of the Web have so far been limited to cross-sectional snapshots. However, with the participation of IA, this research project will design and verify a tool to extract data based on defined inputs, and thus create a virtual observatory of the changing constellations of social information.