Subtleties of the Google date search operators

Published by on

Google logo
Google has for some time permitted temporal filtering of search results, limiting what is presented to those before and/or after a specified date. This can help approximate what search results would have been presented to a user at specific points in time. The Internet Archive Wayback Machine (IAWM) can then fill in more of the hypothetical historical user journey by showing what those webpages might've looked like contemporaneously.

Temporal filtering has long been available via the Tools menu on the search results page as well as through the Google Advanced Search interface. More recently, explicit date operators — before: and after: — were implemented, that could be included in the search query itself.

While conceptually intuitive, in my experience, the application of temporal filtering to Google searches frequently yields counterintuitive results. It turns out that this has a lot to do with the challenges of determining the relevant date(s) of a webpage. Here I'll unpack what I've learned about temporal filtering of Google search results, with some examples using the date operators.

How Google explains its date search operators

Let's start by reviewing Google's explanations of how the date operators work. Unfortunately, they don't appear to provide canonical documentation, so we have to resort to feature announcements and other incidental resources.

The original Twitter thread announcing the date operators provides a number of details:
The announcement also most succinctly and plainly describes what the date operators do as, "The before: & after: commands return documents before & after a date."

Danny Sullivan, Google's public search liaison, commenting on the new date operators at the time of their announcement, critically notes that they work most reliably for news articles, as that is the content type for which Google can most confidently determine dates.

The facility to limit search results to those updated within several pre-specified time frames up to present (i.e., akin to the functionality ostensibly provided by the after: operator) is also available in the Google Advanced Search interface.

Implied by all of this guidance is that date (singular) is a property of a web resource in Google's search index, and that this is what the date operators key off of.

Later in the same aforementioned announcement Twitter thread, there's a discussion of the challenges with reliably determining webpage dates in an automated fashion, which links off to a blog post providing some insight into how they do this. It makes sense that the functionality works best for news articles, since those are published at a point in time and infrequently updated thereafter.

That turns out not to be true, though. While the body content of an old news article typically remains static, the webpage itself more than likely has syndicated elements (e.g., headlines and links for current top articles) that would result in the webpage being updated and, therefore, potentially present as perpetually recent to Google's crawlers.

So Google's official guidance leaves us with the following questions:
  • Does Google store in its index more than one date (e.g., when published, some or all instances when subsequent updates are observed) for any given web resource?
  • Which among possible date sources are favored as the operative one(s)?
  • Are dates only ever parsed out of the HTML or text of a given webpage, or does Google keep track of the dates of changes that it has observed by virtue of crawling it regularly over time?
  • How does more recently published content syndicated onto an older news article affect the operative date(s) for the date operators?

Web content dates in Google's index

Using either of the date operators in a Google search, an immediately apparent difference is that a date is included in the search result snippet. With the site: operator, we can furthermore constrain the results to a given domain or web address path. Utilizing these operators in concert, we can compare the results for any and all content on the Supreme Court of the United States (SCOTUS) website and those for any and all content on the SCOTUS website with dates prior to today.

(Given that the results returned by these queries will vary over time, here also are screenshots for the former and latter examples above. Screenshots will hereafter be provided parenthetically for any other temporally-dependent example queries or webpages.)

Comparing these searches, we also notice a discrepancy between the total number of results returned for what should be functionally equivalent queries — i.e., a domain-constrained search and a domain-constrained search additionally constrained to only those results with dates before today, which should necessarily be the same set. This suggests that not all of the web resources that are responsive to the first query have associated dates in Google's index, and they are therefore not returned for the second, temporalized query.

Drilling down to an individual webpage, we can see from a search constrained to the SCOTUS About page and before today's date (screenshot) that the operative date for the webpage in Google's index is 24 August 2011. What can we understand about the source of this date?

Its provenance is not immediately clear. IAWM suggests that a webpage at that web address existed no later than 23 March 2010. The source code of the capture most closely succeeding the 24 August 2011 snapshot — that is, from 28 August 2011 — includes a couple of timestamps in <meta> elements:
<meta name="created" content="9/30/2010 10:38:54 AM"/>
<meta name="revised" content="4/28/2011 8:12:57 AM"/>
This is the kind of structured information that Google had professed to rely upon, in its blog post on determining dates. However, the key date, 24 August 2011, doesn't appear here or anywhere else in the source code.

The cached original HTTP headers replayed by IAWM (i.e., x-archive-orig-last-modified) can sometimes provide insight as to when a webpage was most recently updated prior to its having been archived. In this case, examining the HTTP headers for the capture most closely succeeding 24 August 2011 provides no indication that was some sort of special date. Indeed, that capture and the one immediately preceding it by about a month are virtually identical.

It's implausible that Google would not have indexed the webpage for more than a year after the webpage was created. The robots.txt rules also remained identical from 19 March 2010 through at least 24 August 2011 and permitted access by Google's crawlers, so neither should that have impeded the indexing of the website.

So let's review what we've learned so far:
  • Web content appears to have no more than one associated operative date in Google's search index, and sometimes may not have one at all.
  • For content lacking such a canonical published date as, say, a news article, the source of the operative date for the date operators may not be easy to discern from openly available information.

What the Google date search operators actually do

There's an important subtlety in how the Google before: operator works; it processes the keyword parts of the query and the before: operator as separate, successive operations, before returning results. This will be easier to explain with an example.

Some time between 24 June and 20 July 2021, the SCOTUS About page was updated to reflect that Patricia McCabe took over from Kathleen Arberg as the Court's Public Information Officer. We can see this update highlighted using the IAWM Changes tool (i.e., by expanding the "Court Officers" menu on both embedded webpages).

From the associated, contemporaneous Press Release, we know that Patricia had not previously served as a Court Officer. There is therefore no reason that her name would have previously appeared on the SCOTUS About page.

Executing a contemporary Google search for her name (screenshot) (and constraining the search to only the page of interest using the site: operator, to simplify presentation), predictably provides the page as a result, since that text is present on the page. Executing a search for "kathleen arberg" (screenshot) also, predictably, does not provide the page as a result, since that text is no longer present on the page.

Now let's try with the before: operator.

We know from IAWM that Kathleen's name was on the webpage at least as late as 24 June 2021, so let's use that as our date. We would intuitively expect that a search for patricia mccabe with the constraint before:2021-06-24 would not return any results. However, it does (screenshot).

Why is that?

It turns out that the query: patricia mccabe before:2021-06-24 is understood as, "return those webpage search results that are relevant to the keywords patricia mccabe (i.e., against the most recent Google search index) boolean-AND that were updated before 24 June 2021." The query is not read (as you might think) as, "return those webpage search results that would've been relevant to the keywords patricia mccabe based on Google's historical search index contents from before 24 June 2021."

The SCOTUS About page is both relevant to the keywords patricia mccabe and was previously updated before 24 June 2021 — the snippet indicates the operative date as 24 August 2011 — so it is correctly returned in the set of results.

Let's quickly look at one more example to test the hypothesis that Google only stores one operative date in its index for a given web resoruce.

IAWM shows that Stephen Breyer's name was replaced with that of Ketanji Brown Jackson in the "Associate Justices" section of the webpage sometime between 30 June and 6 August 2022. However, if we run a Google search constrained to the SCOTUS About page bounded by those dates (screenshot), it returns no results. Google almost certainly crawls the SCOTUS website frequently enough to have noticed the change, but it evidently did not store the date of that observation as an (additional) operative date for the purposes of the date operators.

Let's review what we've learned:
  • Whether or not Google maintains historical search indexes, it does not consult them for the purpose of relevance evaluation for queries using the date operators; relevance evaluation takes place against the contemporary index.
  • Multiple indications seem to point towards Google only storing one operative date in its index for a given web resource.

What dates count for the Google date search operators?

So what about actual news content?

You'll note that the SCOTUS About page didn't have an indicated date in the search results (screenshot).

By contrast, Google offers precise dates for when SCOTUSblog posts were published. Presumably, these dates function as the basis for the date operators.

A SCOTUSblog post on Kathleen Arberg's retirement (screenshot) was published on 2 June 2021. The webpage includes more recent content, incorporated via JavaScript, such as an embedded Twitter widget for the SCOTUSblog account, blog post archives, and featured posts. Technically, any of these should count as the webpage having been updated.

One of the tweets in the embedded Twitter feed is an announcement of the passing of Cecilia Suyat Marshall, the widow of Justice Thurgood Marshall, on 22 November 2022. For obvious reasons, this text couldn't have affected the relevance determination of the webpage for a contemporaneous Google search executed on 2 June 2021. It does, however, affect the relevance determination of the SCOTUSblog post on Kathleen Arberg's retirement (screenshot) for a contemporary Google search.

We can demonstrate this with a few tests:
(Note that the precise search terms used in this demonstration won't yield the same results for long, as the embedded Twitter feed only loads a limited number of tweets by default and the tweet in question will eventually roll off. In that case, check out the screenshots or update the example queries with keywords from more recent tweets.)


What does this all mean?

It's probably the exception rather than the rule at this point that much of any webpages are truly static. Even published articles, by virtue of recency-biased content management systems, will in some sense look new.

When using the Google date operators to try to approximate what search results may have been served to a user at a specific point in time, it will therefore be essential to additionally evaluate not just the dates that the webpages themselves were published or updated, but also the date that the relevance-matched content on those webpages was added.

News articles are more likely to have been static, once published, so relevance matching on article body text can more reliably be supposed to be contemporaneous. Otherwise, for non-news content, consulting IAWM is likely to be the best bet. Without this additional step, you're likely to have false positives in the search results list.