AI for Temporal Web Forensics

Published by on

Large language models (LLMs) have notably advanced beyond interpretation and generation of text. Critical for our purposes here, they can also now "reason", interpret images, and responsively self-augment their knowledge base by retrieving new information (i.e., agentic). Given these new capabilities, in particular, as well as my own desire to understand how artificial intelligence may complement (or perhaps, eventually, supplant) my expertise, it seemed timely to benchmark how well they can currently make sense of the temporal attributes of open web content. To that end, I conducted experiments with some of my go-to example websites and pages:
Created using Bing Image Creator (GPT-4o) with prompt, "an AI that is the master of time" on 16 August 2025.
I chose these specifically because their ostensible publication dates range from being obviously self-declared to entirely opaque or dependent on consulting third-party sources. A number of them additionally offer multiple, contradictory date signals, and so afford the opportunity to assess which signals are either legible or relatively more salient to LLMs.

I ran three types of tests:
For this exercise, I used historical versions of the webpages in question to ensure better alignment between the actual (i.e., ground-truth) publication dates and the (closer-to-) contemporaneous website designs. As I've emphasized previously, the Internet Archive Wayback Machine (IAWM) was indispensable for establishing ground-truth publication or, at least, earliest-available dates.

I tested four freely available frontier models (in late May):
with memorization of previous chats disabled, to the extent possible, and previous chats manually deleted before each new chat session.

With that preamble, I'll cover the individual tests in sequence and then summarize my observations of the models' capabilities.

SCOTUSBlog post

The SCOTUSBlog post has the most straight-forward published date. The body text timestamp says that it was published 2 June 2021. This same date is corroborated by a dateCreated metadata attribute in the source code, a contemporaneous IAWM capture, and the Google bylineDate.

Not all models passed all tests, however; Gemini provided only a gross range for the screenshot interpretation test, and Grok was off by several weeks for both zero-shot and iterative prompting. All of the other model-tests yielded the correct answer.

Here is a summary view of the models' performance:

Model Version Query type Date in response
ChatGPT GPT-4o Zero-shot 2021-06-02
ChatGPT GPT-4o Screenshot interpretation 2021-06-02
Gemini 2.5 Flash Zero-shot 2021-06-02
Gemini 2.5 Flash Screenshot interpretation Late 2020 / early 2021
Grok 3 Zero-shot 2021-06-29
Grok 3 Iterative 2021-06-29
Grok 3 Screenshot interpretation 2021-06-02
Sonnet 4 Zero-shot 2021-06-02
Sonnet 4 Screenshot interpretation 2021-06-02

U.S. Supreme Court website

The initial publication date of the U.S. Supreme Court website was progressively more ambiguous than that of the SCOTUSBlog post. From previous investigations pdf icon, I determined that the original domain of the U.S. Supreme Court website had been supremecourtus.gov. Performing a WHOIS lookup for this domain, it appears to have been registered on 1 December 1997, establishing the absolute earliest date that a website could have gone live. The earliest capture in IAWM dates to 20 May 2000. Examining the x-archive-orig-last-modified HTTP header pushed the date back to 27 April 2000, so I'd say that this is the earliest date that I can substantiate that the website had been published.

The models' guesses varied widely and were generally much less precise than for the SCOTUSBlog post. Sonnet performed the worst, on both the zero-shot and iterative tests; it suggested that the website had been published as early as the mid-1990s, implying that the U.S. Supreme Court was implausibly part of the early web's technological vanguard.

However, two different models (ChatGPT and Gemini) each for one of the tests (zero-shot and iterative, respectively) were able to come up with a plausible date — 17 April 2000 — that I hadn't otherwise discovered. It turns out that the go-live date for the U.S. Supreme Court website had been publicized and recorded contemporaneously on other websites (e.g., Researching Constitutional Law on the Internet: World Constitutions/Comparative Constitutional Law).

On some casual follow-up searching on DuckDuckGo and Google with reasonable search terms, I wasn't able to readily turn up this or related references. Of course, such statements alone would only provide circumstantial evidence that the website had been published earlier, but the date is plausible based on the earliest capture date in IAWM.

Here is a summary view of the models' performance:

Model Version Query type Date in response
ChatGPT GPT-4o Zero-shot 2000-04-17
ChatGPT GPT-4o Screenshot interpretation Late 1990s / early 2000s
Gemini 2.5 Flash Zero-shot Can't say
Gemini 2.5 Flash Iterative 2000-04-17
Gemini 2.5 Flash Screenshot interpretation Late 1990s / early 2000s
Grok 3 Zero-shot 2020-04
Grok 3 Screenshot interpretation 1997 – 2002
Sonnet 4 Zero-shot Mid-to-late 1990s
Sonnet 4 Iterative 1996 – 1998
Sonnet 4 Screenshot interpretation Late 1990s / early 2000s

Speech by Chief Justice Rehnquist

The webpage for Chief Justice Rehnquist's speech had many, and many contradictory, potential dates to work with. The body text of the webpage itself indicated 3 May 2000 both as the date that the speech was given and as the embargo release date. The source code of the webpage had <meta> elements with attributes created and revised both with (ever-so-slightly) implausible 31 December 1600 dates. The Google bylineDate was 5 March 2000. The first IAWM capture for the web address of the live webpage was dated 30 April 2010. The version of the webpage that evidently existed on the previous domain, supremecourtus.gov, has an IAWM capture as old as 16 August 2000. That one has an x-archive-orig-last-modified HTTP header taking the date back to 16 May 2000.

Given the IAWM dates, I think we could therefore say that the most credible dates of publication of the speech webpage are:
Based on the models' responses, they appear to have largely interpreted the tests as asking the latter question. For those responses that were closest to the ground-truth date, the models appear to have just taken for granted that the 3 May 2000 date of the speech and embargo release date meant that the webpage was also published then, which we don't in fact know. Some of the remaining responses featured a range of years that technically encompassed the ground-truth date, but others were wildly off.

Here is a summary view of the models' performance:

Model Version Query type Date in response
ChatGPT GPT-4o Zero-shot 2000-05-03
ChatGPT GPT-4o Screenshot interpretation 2000-05
Gemini 2.5 Flash Zero-shot 2002-03-21
Gemini 2.5 Flash Iterative 2002-03-21
Gemini 2.5 Flash (w/ Deep Research) Zero-shot 2000-05-03
Gemini 2.5 Flash Screenshot interpretation 2000-05-03
Grok 3 Zero-shot 2000-05-03
Grok 3 Screenshot interpretation 2020-06-08
Sonnet 4 Zero-shot 2000-05-03
Sonnet 4 Screenshot interpretation 1995 – 1999

Space Jam website

In contrast to Sonnet's confabulations regarding the U.S. Supreme Court website, the Space Jam website actually was part of the early web's technological vanguard. It is one of the oldest websites that has been continuously available in its original form, albeit at slightly shifting web addresses. A WHOIS query indicates that the domain was registered on 14 March 1996, establishing the earliest date that the website could've been published. The first IAWM capture for the root domain dates to 27 December 1996. This is the earliest that I can substantiate that the website was available. With the debut of the sequel film, the original website moved to the path where it is currently accessible. The first IAWM capture of that version of the website dates to 2 April 2021.

The models seemed reluctant to propose dates much more precise than the year, but almost all of the tests guessed that correctly. Interestingly and appropriately, on the screenshot interpretation tests, they also all mentioned the web design conventions supporting a mid-1990s vintage. On iterative prompting with Gemini, it went so far as to suggest that the website had likely been published in November 1996, coincident with the release of the film. As with the third-party websites publicizing the launch date of the U.S. Supreme Court website, it's useful to have the models integrate this supplementary information into their reasoning. And while the film release date doesn't definitively establish the website launch date, it makes it much more plausible that the website had in fact been available earlier than what IAWM demonstrated.

Here is a summary view of the models' performance:

Model Version Query type Date in response
ChatGPT GPT-4o Zero-shot 1996
ChatGPT GPT-4o Screenshot interpretation 1996
Gemini 2.5 Flash Zero-shot 1996
Gemini 2.5 Flash Iterative 1996-11
Gemini 2.5 Flash Screenshot interpretation 1996
Grok 3 Zero-shot 1996
Grok 3 Screenshot interpretation 1996
Sonnet 4 Zero-shot 1996
Sonnet 4 Screenshot interpretation 1996 – 1998

nullhandle.org blog post

The publication date of my blog post is deceptively self-evident; the body text and web address both clearly point to 11 July 2011. This is, in fact, the date of publication of the original blog post from the Library of Congress Signal blog, from which the blog post on nullhandle.org was cross-posted. A WHOIS query for nullhandle.org shows that it was registered on 17 September 2017. A datePublished attribute in the source code indicates 11 July 2011 but a dateModified attribute indicates 29 September 2017.

So, setting aside my insider knowledge of this particular example, based on public information, a reasonable inference would be that the version of the blog post on nullhandle.org was published no later than 29 September 2017, with the original blog post on the Signal blog having been published with reasonable likelihood on 11 July 2011. Thus, the publication date of the blog post again hinges on how the question is interpreted. Seeing what answer the models come back with will tell us about which date signals are most salient.

For the most part, the models picked up on and reported the original date of publication — 11 July 2011 — which isn't that surprising. What was surprising was that Gemini struggled in both the zero-shot and screenshot interpretation tests, declining to hazard a guess for the former and reporting the more general date range 2009 – 2012 for the latter.

Here is a summary view of the models' performance:

Model Version Query type Date in response
ChatGPT GPT-4o Zero-shot 2011-07-11
ChatGPT GPT-4o Screenshot interpretation 2011-07-11
Gemini 2.5 Flash Zero-shot can't say
Gemini 2.5 Flash Iterative 2011-07-11
Gemini 2.5 Flash Screenshot interpretation 2009 – 2012
Grok 3 Zero-shot 2011-07-11
Grok 3 Screenshot interpretation 2011-07-11
Sonnet 4 Zero-shot 2011-07-11
Sonnet 4 Screenshot interpretation 2011-07-11

Roll-up observations

So what conclusions might we draw from this small set of experiments?

When the date is self-evident — e.g., the body text publication date in the SCOTUSBlog post — the models can usually (though not always) be relied upon to provide the correct answer. For ambiguous cases, where there are multiple date signals, the models tend to be parsimonious; they seem inclined towards the most readily discoverable dates, which are those in the body text. Where they are most useful is in bringing to bear unexpected, supplementary references, such as the other webpages announcing the U.S. Supreme Court website launch date or the fact that Space Jam debuted in November 1996.

If I'm looking for that kind of integrated and expanded discovery, I could see potentially using these tools in the early stages of an investigation. Given current (i.e., now several-month-old) capabilities, I'm disinclined to rely on them for much more than that, as the accuracy of responses — to say nothing of the responses themselves even for the same model and prompt — is highly variable. Anecdotally, repeating the same prompts often yielded materially different answers. And to the extent that responses are correct, they're often less precise than I could otherwise discern using approaches such as IAWM or source code analysis.

In summary, while applying frontier models could be either a useful place to start or a complement to other approaches, they don't appear to be yet reliable enough that more manual and more proven techniques for temporal web forensics won't be the better approach.

PermalinkCreative Commons Attribution-ShareAlike 4.0 International License