I’ve just wrapped up hosting the 2nd Archives Unleashed conference at the Library of Congress. This is the second event I’ve held, working with Jimmy Lin and Ian Milligan at University of Waterloo, with support from the National Science Foundation and the Social Science and Humanities Research Council.
Details on the event can be found here. We were able to bring 50 junior faculty and graduate students together for the 3-day datathon. Students were provided with training and tutorials, and then were given 2 days to work in interdisciplinary teams to build new research projects.
In addition to the datathon, the event was followed by a 1-day public symposium on “Saving the Web.” The symposium featured a keynote by Vint Cerf, and focused on the need for wide-scale policy regarding digital content preservation.
As part of my current research on organizational change, my research team has been working to setup a new repository for social scientists to access Internet Archive data. I’m happy to announce that we’re now live with a Beta version of our data repository, ArchiveHub. ArchiveHub (archivehub.rutgers.edu) is a community data sharing site built on the Hub Zero platform.
In the short run, the site is a work in progress. Under the resources tab, you’ll find data on congressional websites and Occupy Wall Street. The datasets are currently structured for social network analysis, and contain a standard link structure: source, destination, date, frequency, and any associated text that describes the link. Each data set contains a read me file that provides additional information. In the coming months, we’ll be releasing more and more datasets; there is a file under the FAQ section that describes the general datasets that are available, and provides additional information.
The long term vision is that this site will provide a repository for social scientists who are interested in archival Internet data to both (a) access hosted datasets and (b) post their own versions of data as researchers continue to manipulate data and modify them to fit their needs.
In the meantime, we’re always looking for suggestions as to what would be useful for others. I’m working on getting data from our Media dataset live very soon; that repository contains 20TB of data, so it’s taking a while to plow on through it.