Having worked in web archiving and also in and around the law and technology, I'm interested in seeing that courts and legal professionals better understand and make appropriate determinations regarding Internet Archive Wayback Machine (IAWM) evidence.
To that end, it belatedly occurred to me that authoritative best practice guides might already exist, such as from The Sedona Conference. Indeed, I found their most substantial guidance in The Sedona Conference Commentary on ESI Evidence & Admissibility, Second Edition , pending publication sometime this year.
I read the section on IAWM with interest. It does a good job of introducing IAWM, discussing how it has been treated by some courts, and raising some reliability concerns. That said, I think there's room for improvement.
The Electronic Document Retention and Production Working Group evidently solicited public feedback on the draft , but unfortunately I've well missed the deadline to submit comments.
Looking ahead to whenever the next update will be, and to perhaps stimulate discussion in the meantime, I went ahead and submitted feedback to the Working Group anyway. I figured I might as well post an expanded version of those comments here, too.
Picking up on a few specific excerpts:
"Launched in 2001 by the nonprofit Internet Archive, the Wayback Machine is a digital archive of the web." p. 117
This is a good, succinct description of IAWM. For those not already familiar with it, it may be additionally worth noting its typical utility in legal proceedings as a substitute for public, historical web content records that may otherwise be no longer accessible and for establishing when information may have been publicly available on the web.
"Courts have occasionally taken judicial notice of the contents of these archived sites." p. 117
The incidence of courts taking judicial notice of IAWM evidence has definitely increased over time, though it is somewhat jurisdiction-specific. I think "occasionally" understates judicial notice of IAWM evidence as a trend; there are at least thirty-odd U.S. District Court cases going back a decade where IAWM evidence has been explicitly admitted via judicial notice.
"Now, the reliability of the Wayback Machine process may be established by a certificate of an Internet Archive official under Rule 902(13)." p. 117
This isn't just a contemporary development; an IA affidavit was successfully used for admission of IAWM evidence as early as 2004, in one of the earliest cases to rely upon IAWM evidence, Telewizja Polska U.S.A., Inc. v. Echostar Satellite Corp., 02 C 3293 (N.D. Ill. Oct. 15, 2004). There is a good track record for IAWM evidence being admitted when accompanied by an IA affidavit.
Apart from judicial notice and an IA affidavit, IAWM evidence is also sometimes authenticated by a witness with personal knowledge of the historical web content, under Federal Rule of Evidence 901(b)(1) – e.g., United States v. Bansal, 663 F.3d 634 (3d Cir. 2011); Foreword Magazine, Inc. v. Overdrive, Inc., 1:10-cv-1144 (W.D. Mich. Oct. 31, 2011) – or by an expert witness, under Federal Rule of Evidence 702 – e.g., Khoday v. Symantec Corp., 93 F. Supp. 3d 1067 (D. Minn. 2015); Marten Transp., Ltd. v. Plattform Advertising, Inc., 184 F. Supp. 3d 1006 (D. Kan. 2016).
"Although the Wayback Machine captures information..." p. 117
IAWM is a replay engine for archived web content. The web archives themselves are principally generated by archival web crawlers, not IAWM. The minor exception is IAWM's Save Page Now feature.
"...what it actually memorializes is inconsistent." p. 117
This is partly correct.
Given the size of the web, its rate of change, and the relative levels of funding for cultural heritage institutions, web archiving tends to be a best-effort enterprise. Constraints of system resources, crawl run time, uncooperative web servers, co-incident web traffic, fleeting network interruptions, selective quality assurance checks, shifting robot exclusion rules, and so on mean that no recurring crawl results in the archiving of all of its hypothetical candidate content, or even the same content, for any given run.
There are, however, many distinct subsets of archived web content populating IAWM that are collected using more consistent approaches. Examples include the U.S. Presidential End of Term crawls, national domain crawls, and Archive-It collections. The more focused and curated nature of these collections means that their content tends to be more systematically archived.
"Moreover, users can ask that the archive delete or change information..." p. 118
This is incorrect or, at best, highly imprecise.
According to IA's current help documentation, site owners may request that their site's pages be "excluded or removed" from IAWM. This is more than likely a de-indexing process, which leaves the archived content intact on IA's servers but disables it from access in IAWM, as well as effects the exclusion of the indicated domain(s) from prospective crawls.
They do not alter the stored contents of IAWM; the only modification made is through the insertion of IAWM code at time of access, in the user's browser, to produce the IAWM banner and to rewrite links to point back into the archive.
Part of the historical basis for this was clearer in a previous version of IA's FAQs. It at least used to be the case that contemporary robot exclusion rules disabled retroactive access to that site's archived content in IAWM.
In recent years, IA has been exercising more discretion with respect to robots.txt directives that would prevent archival access and archiving. The robots.txt standard is, in any case, voluntary, and the foundational policy for IAWM also heavily emphasized discretion in consideration of exclusion requests.
It's otherwise inconsistent with IA's archiving mission, and technically impractical, for it to be regularly deleting or changing the contents of the web archive with every robots exclusion toggle it encounters. If IA were amenable to deleting or changing content based on user requests, it's hard to imagine that IAWM evidence would be as generally accepted by the courts as it is.
"...This led at least one court to find that a party could not show that data from the archive was 'reliable, complete, and admissible in court.' As a result, the Wayback Machine is not accepted as a forensic evidence collection method." p. 118
My understanding of the cited case, Leidig v. BuzzFeed, Inc., 16 Civ. 542 (VM) (GWG) (S.D.N.Y. Dec. 19, 2017) is that plaintiffs were under threat of sanction for having failed to preserve responsive historical web content and argued that the same content being available via IAWM should be a sufficient substitute. They did not appear, however, to provide any basis for the court to conclude that the IAWM evidence was "reliable, complete, and admissible".
That is to say, the court didn't admit the IAWM evidence because no attempt was made at establishing an appropriate foundation, which has been an unremarkable outcome for many other cases in which parties likewise assumed that IAWM evidence was self-authenticating or where they were just tacitly counting on judicial notice – e.g., Novak v. Tucows, Inc., 06-CV-1909 (JFB) (ARL) (E.D.N.Y. Mar. 26, 2007); St. Luke's Cataract v. Sanderson, 573 F.3d 1186 (11th Cir. 2009); Keystone Retaining Wall Sys. Inc. v. Basalite Concrete Prods. LLC, 10-CV-4085 (PJS/JJK) (D. Minn. Dec. 19, 2011); Setai Hotel Acquisition, LLC v. Miami Beach Luxury Rentals, Inc., 16-21296-Civ-Scola (S.D. Fla. Aug. 15, 2017).
As a counterpoint, in Rutherford v. Evans Hotels, LLC, 18-CV-435 JLS (MSB) (S.D. Cal. Sep. 3, 2020), the judge denied admission of web content archived by a litigation / records management web archiving company in part based upon its operation being relatively less transparent and more interested in case outcomes than IA: "Coupled with the differences between PageFreezer and the Internet Archive—including that the Internet Archive offers a free, publicly available service whereas PageFreezer offers a paid service in anticipation of litigation—the Court declines to admit the PageFreezer archives as analogous to those from the Wayback Machine admitted in other litigation." p. 22
"The ISO 28500 WARC (Web ARChive) standard, established by the International Internet Preservation Consortium, addresses authentication issues by making it possible to obtain an exact native file of the collected content of a website." p. 118
As the industry standard developed and adopted by all of the major cultural heritage web archiving players (i.e., IA, national libraries, and major research university libraries), it doesn't hurt that the WARC format is mentioned. In the case of litigation / records management-focused web archiving companies, I could see how it would provide for more authoritative attestations regarding web archive evidence.
That said, the technologies employed to capture and replay web content, their configuration, their interaction with specific websites, etc. have at least as much bearing on admissibility concerns as the mere fact of using WARC as a storage format.
For example, if a crawl is configured such that the date of capture of individual webpages on a given site is staggered by multiple days (this is not uncommon with large-scale crawls using the Heritrix archival web crawler, given how the crawler queues discovered resources for capture in its default configuration), there is a greater chance that content across an individual site may have changed in the intervening time. WARC doesn't itself assure that the content captured for a given site represents a temporally-coherent snapshot; it just provides the metadata to perform an assessment and justify assertions for the authentication of specific evidence.
Though the contents of IAWM are stored in WARC files, analysis and interpretation of that content is dependent on the affordances of IAWM. Given the predominant use of IAWM web archives in litigation, it'd be worth adding more detail about how IAWM works and some of the reliability concerns it presents (e.g., incompleteness, mixed provenance, temporal incoherence, live site "leakage", canonicality, among other reasons that what is presented in IAWM can rarely be taken at face value for evidentiary purposes).
The fact that web content presented in IAWM invariably enters the legal record as flattened screenshots makes it all the more important that additional, relevant context provided by IAWM and fundamentally enabled by WARC accompany them, by way of explanation and in service of proper authentication.
"The saved data is an identical replica of the website, with working links, graphics, and other dynamic content." p. 118
This is slightly misleading.
It would be more accurate to say that WARC stores the contents of a website as presented to and archived by a specific web capture agent. Many contemporary websites run on a database-backed content management system (e.g., WordPress). In these cases, archival crawlers can't create a replica of the website as it exists on the server – i.e., a content management system database. They instead archive the static representation of the website as presented – i.e., as a collection of webpages, images, scripts, stylesheets, etc.
Websites or web applications that have an even heavier server-side component may demand a more sophisticated suite of technologies than crawlers and WARC files alone to enable the encapsulation and re-presentation of something resembling the original website. That is to say that archiving all but the simplest websites necessarily involves abstracting them from their original context and server-side dependencies. Web archiving such as conducted by IA aims to preserve and re-present the website at it was presented, but not necessarily to produce an identical replica of how it was set up on a particular web server.