Using Wayback Machine for Research

Published by Nicholas Taylor on 26 October 2012

Prompted by questions from Library of Congress staff on how to more effectively use web archives to answer research questions, I recently gave a presentation on "Using Wayback Machine for Research" (PDF). I thought that readers of The Signal might be interested in this topic as well. This post covers the outline of the presentation.

The Wayback Machine that many people are familiar with is the Internet Archive Wayback Machine. The Internet Archive is an NDIIPP partner and a Founding Member of the International Internet Preservation Consortium. Their mission includes creating an archive of the entire public web; the Wayback Machine is the interface for accessing it.

While the Internet Archive has been primarily responsible for the development of Wayback Machine, it is an open source project. Internet Archive also devised the name "Wayback Machine;" it is a reference to The Rocky & Bullwinkle Show's homophonous "WABAC" Machine, a time machine itself named in the convention of mid-century mainframe computers (e.g., ENIAC, UNIVAC, MANIAC, etc.). The contemporary Wayback Machine thus appropriately evokes both the idea of traveling back in time and powerful computing technology (necessary for web archiving).

Internet Archive's Wayback Machine is just one among many, however; over half of the web archiving initiatives listed on Wikipedia provide access via Wayback Machine. It is the most common software used to "replay" the contents of ISO-standard Web ARChive (WARC) file containers.

Wayback Machine performs this feat by dynamically rewriting the links it encounters on archived webpages to point to other resources in the archive. It does an admirable job at this, but, with as much variation as it encounters between websites, it may have trouble replaying particular webpages or webpage elements. JavaScript-driven features, for example, are especially problematic.

Understanding the basic mechanics of Wayback Machine makes it easier to navigate around within a web archive. For example, the URL can be modified to request particular resources, show the time coverage for particular resources in the archive, or show all archived resources from a particular domain. Since Wayback Machine can only replay specifically-requested URLs, it is difficult to access past versions of a webpage if that webpage changed URLs at some point and there was no redirect in place.

2006 Library of Congress website displayed in Internet Archive's Wayback Machine — 2006 Library of Congress website displayed in the Internet Archive Wayback Machine

The presentation offers a couple of examples of how these basic techniques could be used to find specific information in a web archive. The first example explores a strategy for finding a webpage whose historical URL is unknown by navigating to another webpage in the archive that is likely to link to it. The second example demonstrates that the conceptual organization of websites persists longer than their precise URL structure. This trend can be used to access content that was previously publicly available but has since been moved to a private section of a website.

Of course, it may not even be necessary to consult web archives in the first place. Recent research (PDF) suggests that ostensibly missing resources on the live web have more often been moved than removed. The Synchronicity Firefox add-on, based on technology from the NDIIPP-funded Memento project, leverages web archives to help locate the resource's new location. If that fails, the MementoFox Firefox add-on can help to find the web archive with the best coverage for the desired resource and time range.

Click on the following links to learn more about Web Archiving at the Library of Congress or to view the Library of Congress Web Archives, displayed (naturally) via Wayback Machine.

Permalink | Crossposted to The Signal