Harvesting and Preserving the Future Web: Replay and Scale Challenges

Published by Nicholas Taylor on 18 June 2012

This is the second part of a two-post recap of the "Harvesting and Preserving the Future Web" workshop at the recent International Internet Preservation Consortium General Assembly.

The session was divided into three topics:

Capture: challenges in acquiring web content;
Replay: challenges in recreating the user experience from the archived content; and
Scale: challenges in doing both of these at web scale.

Having covered the topic of capture previously, this post addresses replay and scale.

Replay

Andy Jackson from the British Library's UK Web Archive explained how he enhanced the Wayback Machine's archival replay to allow for in-page video playback using FlowPlayer and re-enabled dynamic map services using OpenStreetMap. While inarguably providing a better user experience than the alternative of, respectively, video disaggregated to a separate archive and static Google Maps image tiles, the re-intermediation of technologies that were technically absent from the archive prompted questions about what it meant to "recreate the user experience."

In the ensuing conversation, David Rosenthal from the LOCKSS Program questioned whether in the context of long-term preservation it would be useful to conceptualize the "user experience" as extending beyond the boundaries of the website itself; perhaps the Wayback Machine should be re-engineered to serve the archived website within an emulated contemporaneous browser, for example.

Responding to this suggestion, Bjarne Andersen from the Netarchive.dk (Danish, national) web archiving program noted that SCAlable Preservation Environments (SCAPE) will be built to be sensitive to the differences in viewing a website in different generations of browsers; SCAPE will compare screenshots to determine whether the contemporary rendering remains faithful to the website's historical appearance.

Summarizing the session, my takeaway was that robust preservation of the user experience may require more than simply replaying whatever happens to be in the archive.

Scale

Readers of The Signal will be familiar with some of our past discussions of the challenges of scale.

In the first panelist presentation, Gordon Mohr, lead architect of Heritrix, noted that the dilemma of scaling storage has been "solved" with the maturation of cloud services. The caveat was that funding has thus become the more fundamental scale limitation. Aaron Binns of the Internet Archive argued that the long-term financial sustainability of digital preservation often doesn't receive as much attention as the infrastructure challenges, and the Blue Ribbon Task Force on Sustainable Digital Preservation and Access was commissioned for this reason.

Returning to the discussion of infrastructure-related scale challenges, Rob Sanderson from the Los Alamos National Laboratory Research Library cited the difficulty of maintaining synchronization of resources located at multiple network endpoints. The dominant protocol of the web, HTTP, is poorly-suited to synchronizing resources that are either large (because failed transfers have to be re-initiated from the beginning) or that change rapidly (because requests can't necessarily be submitted quick enough to keep pace with the rate of change). To address these challenges, the Research Library is working on a new framework called ResourceSync.

On an impressive side note, Youssef El Dakar from the Bibliotheca Alexandrina noted that simply regenerating checksums for their 100 TB of web archives would take an entire year with their current infrastructure.

Summarizing the session, my takeaway was that scale was the least tractable of the three sets of challenges; it's easier to imagine that technical breakthroughs might make a significant difference for either capture or replay, but resource scarcity is a more fundamental problem.

Permalink | Crossposted to The Signal