Harvesting and Preserving the Future Web: Content Capture Challenges

Published by Nicholas Taylor on 1 June 2012

Following our earlier summary of the recent International Internet Preservation Consortium General Assembly, I thought I'd share some of the insights from the workshop, "Harvesting and Preserving the Future Web".

The workshop was divided into three topics:

capture, challenges in acquiring web content;
replay, challenges in recreating the user experience from the archived content; and
scale, challenges in doing both of these at web scale.

I'll be talking about capture here, leaving replay and scale for a second post.

Capture

Kris Carpenter Negulescu from the Internet Archive cued up the session with an overview (PDF) of challenges to capturing web content. She noted that the web-as-a-collection-of-documents is rapidly becoming something much more akin to a programming environment, marked by desktop-like interactive applications, complex service and content mashups, social networking services, streaming media, and immersive worlds.

Kris also provided an overview of current and prospective strategies for tackling these challenges: making the traditional web crawler behave more like a browser; integrate diverse approaches into unified workflows; design and code new custom tools; record screenshots to capture look-and-feel; record video of user interactions; and deposit web content into archives.

Adam Miller, also from Internet Archive, explained his use of PhantomJS, a headless browser, to identify JavaScript links and trigger AJAX content that might otherwise be opaque to Heritrix. Herbert van de Sompel from Los Alamos National Laboratories Research Library followed, presenting (PDF) a non-traditional web archiving paradigm called transactional web archiving. Instead of dispatching a client crawler, a transactional archive-enabled web server would "archive itself" by mirroring its HTTP responses to an archive.

Notwithstanding overcoming some of the technical challenges in capturing content, there was consensus that increasing personalization and integration of third-party services was eroding the notion that any sort of canonical user experience could be archived. David Rosenthal from the LOCKSS Program expressed this sentiment most eloquently with the comment, "we may have to settle for capturing 'a' user experience rather than 'the' user experience."

On the other hand, it could be said that which has been archived of the web so far hasn't been as generic as supposed, given extant customization based on geography, user-agent, and cookies. Kris took this point further, arguing that we should be careful to suppose that archives have even historically been representative of a "universalized" social experience; some of the oldest preserved documents reflect only the upper classes of ancient Egyptian society.

Summarizing the session, my takeaway was that the field needed not only innovative new technical approaches to capturing content but also an evolution in understanding of what it meant to "archive the web."

For another account of the session, please see also David Rosenthal's write-up.

Permalink | Crossposted to The Signal