From the Library of Congress

I’ve just wrapped up hosting the 2nd Archives Unleashed conference at the Library of Congress. This is the second event I’ve held, working with Jimmy Lin and Ian Milligan at University of Waterloo, with support from the National Science Foundation and the Social Science and Humanities Research Council.

Details on the event can be found here. We were able to bring 50 junior faculty and graduate students together for the 3-day datathon. Students were provided with training and tutorials, and then were given 2 days to work in interdisciplinary teams to build new research projects.

In addition to the datathon, the event was followed by a 1-day public symposium on “Saving the Web.” The symposium featured a keynote by Vint Cerf, and focused on the need for wide-scale policy regarding digital content preservation.

Hacking at the Archives

Live from the Robarts Library at University of Toronto, Toronto, CA, we’ve just kicked off the first of the Archives Unleashed datathon series. This is a series of events that I’m co-hosting with Ian Milligan and Jimmy Lin at University of Waterloo, Nathalie Casemajor at UQO and Nich Worby at University of Toronto.

We’re supported by generous funding from the Social Science and Humanities Research Council, as well as the National Science Foundation (Award #1624067).

Archives Unleashed is a datathon event intended to educate graduate students and faculty on the various ways in which new advances in computing can be utilized to conduct research using Web archive data. Details on this first event can be found here, and we’ll be hosting a second event in June at the Library of Congress – details on that event are here.  It has been amazing to see the Web archive research community grow, and this event is just a next step in that progression – I’ll be sharing the outcomes once we wrap!

Culpability when publishing goes awry

From the dredges of academia… a recent study published in Science Magazine claimed to show that with regards to gay marriage, voters could be influenced by something as simple as a conversation with a gay vote canvasser (quick summary – but the full study is here).

The study is certainly interesting, but it appears that some of the data may have been falsified. This is still an unfolding story, so I don’t want to rush to judgment.

My issue here is in some of the commentary regarding the story. I was listening to NPR last night, and Kenneth Prewitt, the incoming head of the American Association of Political and Social Science, was interviewed. During the interview, he was asked if the faculty member (and 2nd author) on the study was culpable given that the graduate student (and 1st author) had been responsible for the data. Prewitt stated that the faculty member was NOT culpable, and was not responsible for the data.

The idea that a faculty member – or any author – is not responsible for knowing the data in a journal article strikes me as unacceptable. I believe that if you’re working on a journal article that is going to be published – particularly with graduate students – you have an obligation to understand all aspects of the data. Yes – that sets a high bar – but the bar should be high.

Faculty members who work with graduate students are responsible for teaching those graduate students about proper research protocol, the nuances of research and data analysis, and the core tenets of research publication. Data are often messy, and a lot can go wrong in the process of moving from raw data to publication. Moreover, the method of data collection matters as well, and can have a significant impact on the resulting analysis. To trust a co-author to take responsibility for these choices is setting the stage for problems further down the road.

All of this is not to say that Donald Green, the 2nd author, behaved in this way. My issue here is with Prewitt’s statement that Green is not culpable. He is an author on the study. He is culpable. What Green’s culpability means in the long run is a subject for debate and investigation, and I’m not passing judgement beyond stating that I believe he should be held responsible.

As I step off my soap box, I’ll close by reinforcing the growing call for transparency in all aspects of research. I’m starting to work more and more with large datasets and algorithmic data collection. My team will be publishing all of our code and raw datasets to github and a public data repository so that others can replicate our work. I hope the others will increasingly follow suit.

New research at NCA; updates from an island.

I’ve updated this page to include upcoming presentations at the National Communication Association (NCA). I’ll be presenting work on the Occupy Wall Street movement, which leverages website tracking data extracted from the NSF Internet Archive project. This is new research, and looks at the organizational structure created by the connections between websites. In addition, Heewon Kim, a graduate student at Rutgers, will be presenting our research on virtuality in a large technology consulting organization.

Looking to next steps, what better way to percolate new grant ideas than to lock up a bunch of academics on an island for a few days? I’ve had the privilege of spending the past few days on Bainbridge Island just outside of Seattle at an NSF-sponsored workshop on large-scale data, politics and organizations (among other topics). The location is beautiful, and it’s wonderful opportunity to reflect on research and develop new ideas. Among other things, it’s given me a chance to think about upcoming planned research on the impact of algorithms on the operation of newsrooms; we’ve had a lot of discussions about the impact that algorithms have had on automating the production of and distribution of news content in many major newsrooms. NYT is one example of a newsroom where this has been occurring, but it happens on a much smaller scale (see, for instance, the NJ Star Ledger, where this has also been happening).

At organizations like the Star Ledger, the majority of decisions regarding coverage of news are still made in the page one meetings, and by the editors – this generally hasn’t changed across the board. However, during the day the order of articles throughout the website is somewhat fluid. Stories change in terms of prominence on a web page based on the topic, the number of clicks, the length of the article, and a number of other factors; the automatic organization of the web site is algorithmically determined. In my opinion, the most important aspect of this trend is that the algorithm is designed and implemented by the engineering team, and the editorial board is generally not directly involved in the process. Thus, the process of news distribution is automated without editing.

I’ll start to post more on this as I continue to develop this work.
2014-11-11 07.50.48

Photo: The morning view from Blakely Harbor, Bainbridge Island, WA.

ArchiveHub is Live (Beta release)!

As part of my current research on organizational change, my research team has been working to setup a new repository for social scientists to access Internet Archive data. I’m happy to announce that we’re now live with a Beta version of our data repository, ArchiveHub. ArchiveHub ( is a community data sharing site built on the Hub Zero platform.

In the short run, the site is a work in progress. Under the resources tab, you’ll find data on congressional websites and Occupy Wall Street. The datasets are currently structured for social network analysis, and contain a standard link structure: source, destination, date, frequency, and any associated text that describes the link. Each data set contains a read me file that provides additional information. In the coming months, we’ll be releasing more and more datasets; there is a file under the FAQ section that describes the general datasets that are available, and provides additional information.

The long term vision is that this site will provide a repository for social scientists who are interested in archival Internet data to both (a) access hosted datasets and (b) post their own versions of data as researchers continue to manipulate data and modify them to fit their needs.

In the meantime, we’re always looking for suggestions as to what would be useful for others. I’m working on getting data from our Media dataset live very soon; that repository contains 20TB of data, so it’s taking a while to plow on through it.

Summer, research, & good things to come.

It’s the end of another Academic Year, and the end of my third year at Rutgers University. This year was marked by my third year review, and provided a great opportunity to reflect on past accomplishments and to look forward to the next three years and the road to tenure. I’ve been hard at work on the Internet Archive data, and the team is starting to see the result of the past 1.5 years of hard work. On June 17 and 18 I’ll be co-hosting a conference at Harvard University to showcase the work we’ve been doing, and to bring together a community of researchers interested in this space.

In the meantime, I’ll be speaking on Big Data and social science research at the Annual Meeting of the International Communication Association in Seattle, WA, from May 25 – May 28, 2014. Later in the summer, I’ll also be speaking at the Annual Meeting of the Association for Education in Journalism and Mass Communication, and I’ll be a respondent at the Academy of Management’s Annual Meeting.

In the midst of the usual conference travel, I’m excited to begin a new series of on-site data collection at a major Fortune 500 corporation. In collaboration with a few other colleagues, I’ll be spending 5 weeks on-site at the organization collecting data about collaboration and knowledge sharing through the use of enterprise social media. As the research progresses, I’ll be posting updates here and on Twitter.