Large language models (LLMs) have notably advanced beyond interpretation and generation of text. Critical for our purposes here, they can also now "reason", interpret images, and responsively self-augment their knowledge base by retrieving new information (i.e., agentic). Given these new capabilities, in particular, as well as my own desire to understand how artificial intelligence may complement (or perhaps, eventually, supplant) my expertise, it seemed timely to benchmark how well they can currently make sense of the temporal attributes of open web content. To that end, I conducted experiments with some of my go-to example websites and pages:
Created using Bing Image Creator (GPT-4o) with prompt, "an AI that is the master of time" on 16 August 2025.
I chose these specifically because their ostensible publication dates range from being obviously self-declared to entirely opaque or dependent on consulting third-party sources. A number of them additionally offer multiple, contradictory date signals, and so afford the opportunity to assess which signals are either legible or relatively more salient to LLMs.
I ran three types of tests:
Zero-shot: asking when a particular website or webpage was first published
Iterative: if the zero-shot prompt yielded a poor answer, I followed up by asking the model either how it came up with the date (i.e., if wildly off in its first response) or to venture a guess (i.e., if it demurred on being specific in its first response)
Screenshot interpretation: asking the model to estimate the publication date of a webpage based on a screenshot
For this exercise, I used historical versions of the webpages in question to ensure better alignment between the actual (i.e., ground-truth) publication dates and the (closer-to-) contemporaneous website designs. As I've emphasized previously, the Internet Archive Wayback Machine (IAWM) was indispensable for establishing ground-truth publication or, at least, earliest-available dates.
I tested four freely available frontier models (in late May):
with memorization of previous chats disabled, to the extent possible, and previous chats manually deleted before each new chat session.
With that preamble, I'll cover the individual tests in sequence and then summarize my observations of the models' capabilities.
SCOTUSBlog post
The SCOTUSBlog post has the most straight-forward published date. The body text timestamp says that it was published 2 June 2021. This same date is corroborated by a dateCreated metadata attribute in the source code, a contemporaneous IAWM capture, and the Google bylineDate.
Not all models passed all tests, however; Gemini provided only a gross range for the screenshot interpretation test, and Grok was off by several weeks for both zero-shot and iterative prompting. All of the other model-tests yielded the correct answer.
Here is a summary view of the models' performance:
Model
Version
Query type
Date in response
ChatGPT
GPT-4o
Zero-shot
2021-06-02
ChatGPT
GPT-4o
Screenshot interpretation
2021-06-02
Gemini
2.5 Flash
Zero-shot
2021-06-02
Gemini
2.5 Flash
Screenshot interpretation
Late 2020 / early 2021
Grok
3
Zero-shot
2021-06-29
Grok
3
Iterative
2021-06-29
Grok
3
Screenshot interpretation
2021-06-02
Sonnet
4
Zero-shot
2021-06-02
Sonnet
4
Screenshot interpretation
2021-06-02
U.S. Supreme Court website
The initial publication date of the U.S. Supreme Court website was progressively more ambiguous than that of the SCOTUSBlog post. From previous investigations, I determined that the original domain of the U.S. Supreme Court website had been supremecourtus.gov. Performing a WHOIS lookup for this domain, it appears to have been registered on 1 December 1997, establishing the absolute earliest date that a website could have gone live. The earliest capture in IAWM dates to 20 May 2000. Examining the x-archive-orig-last-modified HTTP header pushed the date back to 27 April 2000, so I'd say that this is the earliest date that I can substantiate that the website had been published.
The models' guesses varied widely and were generally much less precise than for the SCOTUSBlog post. Sonnet performed the worst, on both the zero-shot and iterative tests; it suggested that the website had been published as early as the mid-1990s, implying that the U.S. Supreme Court was implausibly part of the early web's technological vanguard.
However, two different models (ChatGPT and Gemini) each for one of the tests (zero-shot and iterative, respectively) were able to come up with a plausible date — 17 April 2000 — that I hadn't otherwise discovered. It turns out that the go-live date for the U.S. Supreme Court website had been publicized and recorded contemporaneously on other websites (e.g., Researching Constitutional Law on the Internet: World Constitutions/Comparative Constitutional Law).
On some casual follow-up searching on DuckDuckGo and Google with reasonable search terms, I wasn't able to readily turn up this or related references. Of course, such statements alone would only provide circumstantial evidence that the website had been published earlier, but the date is plausible based on the earliest capture date in IAWM.
Here is a summary view of the models' performance:
Given the IAWM dates, I think we could therefore say that the most credible dates of publication of the speech webpage are:
No later than 30 April 2010, if the question is construed as, "what was the date of publication of the webpage at the current live web address?"
No later than 16 May 2000, if the question is construed as, "what was the date of publication of this webpage, for any of the U.S. Supreme Court domain web addresses where it has been published?"
Based on the models' responses, they appear to have largely interpreted the tests as asking the latter question. For those responses that were closest to the ground-truth date, the models appear to have just taken for granted that the 3 May 2000 date of the speech and embargo release date meant that the webpage was also published then, which we don't in fact know. Some of the remaining responses featured a range of years that technically encompassed the ground-truth date, but others were wildly off.
Here is a summary view of the models' performance:
In contrast to Sonnet's confabulations regarding the U.S. Supreme Court website, the Space Jam website actually was part of the early web's technological vanguard. It is one of the oldest websites that has been continuously available in its original form, albeit at slightly shifting web addresses. A WHOIS query indicates that the domain was registered on 14 March 1996, establishing the earliest date that the website could've been published. The first IAWM capture for the root domain dates to 27 December 1996. This is the earliest that I can substantiate that the website was available. With the debut of the sequel film, the original website moved to the path where it is currently accessible. The first IAWM capture of that version of the website dates to 2 April 2021.
The models seemed reluctant to propose dates much more precise than the year, but almost all of the tests guessed that correctly. Interestingly and appropriately, on the screenshot interpretation tests, they also all mentioned the web design conventions supporting a mid-1990s vintage. On iterative prompting with Gemini, it went so far as to suggest that the website had likely been published in November 1996, coincident with the release of the film. As with the third-party websites publicizing the launch date of the U.S. Supreme Court website, it's useful to have the models integrate this supplementary information into their reasoning. And while the film release date doesn't definitively establish the website launch date, it makes it much more plausible that the website had in fact been available earlier than what IAWM demonstrated.
Here is a summary view of the models' performance:
Model
Version
Query type
Date in response
ChatGPT
GPT-4o
Zero-shot
1996
ChatGPT
GPT-4o
Screenshot interpretation
1996
Gemini
2.5 Flash
Zero-shot
1996
Gemini
2.5 Flash
Iterative
1996-11
Gemini
2.5 Flash
Screenshot interpretation
1996
Grok
3
Zero-shot
1996
Grok
3
Screenshot interpretation
1996
Sonnet
4
Zero-shot
1996
Sonnet
4
Screenshot interpretation
1996 – 1998
nullhandle.org blog post
The publication date of my blog post is deceptively self-evident; the body text and web address both clearly point to 11 July 2011. This is, in fact, the date of publication of the original blog post from the Library of Congress Signal blog, from which the blog post on nullhandle.org was cross-posted. A WHOIS query for nullhandle.org shows that it was registered on 17 September 2017. A datePublished attribute in the source code indicates 11 July 2011 but a dateModified attribute indicates 29 September 2017.
So, setting aside my insider knowledge of this particular example, based on public information, a reasonable inference would be that the version of the blog post on nullhandle.org was published no later than 29 September 2017, with the original blog post on the Signal blog having been published with reasonable likelihood on 11 July 2011. Thus, the publication date of the blog post again hinges on how the question is interpreted. Seeing what answer the models come back with will tell us about which date signals are most salient.
For the most part, the models picked up on and reported the original date of publication — 11 July 2011 — which isn't that surprising. What was surprising was that Gemini struggled in both the zero-shot and screenshot interpretation tests, declining to hazard a guess for the former and reporting the more general date range 2009 – 2012 for the latter.
Here is a summary view of the models' performance:
Model
Version
Query type
Date in response
ChatGPT
GPT-4o
Zero-shot
2011-07-11
ChatGPT
GPT-4o
Screenshot interpretation
2011-07-11
Gemini
2.5 Flash
Zero-shot
can't say
Gemini
2.5 Flash
Iterative
2011-07-11
Gemini
2.5 Flash
Screenshot interpretation
2009 – 2012
Grok
3
Zero-shot
2011-07-11
Grok
3
Screenshot interpretation
2011-07-11
Sonnet
4
Zero-shot
2011-07-11
Sonnet
4
Screenshot interpretation
2011-07-11
Roll-up observations
So what conclusions might we draw from this small set of experiments?
When the date is self-evident — e.g., the body text publication date in the SCOTUSBlog post — the models can usually (though not always) be relied upon to provide the correct answer. For ambiguous cases, where there are multiple date signals, the models tend to be parsimonious; they seem inclined towards the most readily discoverable dates, which are those in the body text. Where they are most useful is in bringing to bear unexpected, supplementary references, such as the other webpages announcing the U.S. Supreme Court website launch date or the fact that Space Jam debuted in November 1996.
If I'm looking for that kind of integrated and expanded discovery, I could see potentially using these tools in the early stages of an investigation. Given current (i.e., now several-month-old) capabilities, I'm disinclined to rely on them for much more than that, as the accuracy of responses — to say nothing of the responses themselves even for the same model and prompt — is highly variable. Anecdotally, repeating the same prompts often yielded materially different answers. And to the extent that responses are correct, they're often less precise than I could otherwise discern using approaches such as IAWM or source code analysis.
In summary, while applying frontier models could be either a useful place to start or a complement to other approaches, they don't appear to be yet reliable enough that more manual and more proven techniques for temporal web forensics won't be the better approach.