The Temporal Dimensions of Google Search as Revealed by the Leaked API Documentation

Published by on

Google logo
The search engine optimization community has been abuzz for the last couple of months with the inadvertent leak of a bundle of internal Google Search API documentation, revealing unprecedented detail about how the ranking algorithm works. Given my own efforts at deciphering the behavior of the Google Search date operators, I was particularly interested in what insights the documentation might contribute or corroborate on that front.

To that end, I mined (a copy of) the documentation for all references to temporalized web content. Here I'll walk through some of the more noteworthy findings. In summary, there are some nuances to how Google handles temporal context that I likely couldn't have discerned experimentally, but unfortunately it's not enough to really now know how Google assigns a date to any given web content.

The syntactic date and its provenance

Google calls the key operative date the syntactic date (module NlpSaftDocument, attribute syntacticDate). The syntactic date may evidently be derived from various sources; the two examples provided in the attribute description are from the web address (module CompositeDoc, attribute urlDate) or from the document title. Elsewhere in the documentation, it is suggested that the syntactic date may alternatively be derived from an explicit document publication date (module QualityTimebasedSyntacticDate, attribute date).

Google tracks many other types of temporal information:
It's reasonable to suppose that these attributes may either inform or serve as potential sources for the syntactic date, but the relationships and order of precedence aren't entirely clear from the documentation.

In Google's blog post on the provenance of webpage dates in their index, both the explicit document publication date (module QualityTimebasedSyntacticDate, attribute date) and the last significant update (module NlpSaftDocument, attribute lastSignificantUpdate) are highlighted as example sources, so we know that at least those are applicable, in addition to the example sources mentioned in the syntactic date attribute description (module NlpSaftDocument, attribute syntacticDate).

Other temporal information unrelated(?) to the syntactic date

Google stores other kinds of temporal information, but the documentation doesn't appear to specify whether or how these bear on the syntactic date. For example:

Search snippet display date provenance

Google calls the date displayed in a search result snippet the byline date (module NlpSaftDocument, attribute bylineDate).

The syntactic date may be used as the byline date (module QualityTimebasedSyntacticDate, attribute useAsBylineDate). This appears to depend on a confidence weighting of the syntactic date as byline date (module QualityTimebasedSyntacticDate, attribute info), that is supported in part by comparison with a measurement of the content's age (module NlpSaftDocument, attribute contentage).

Notwithstanding a high-confidence syntactic date, for whatever reason there is a mechanism to toggle that serving also as the byline date (module QualityTimebasedSyntacticDate, attribute trustSyntacticDateInRanking).

Operative date for date filtering

As with the byline date, it appears that the syntactic date serves as the presumptive source for the date restrict date - i.e., the operative date for the before: and after: date operators as well as the last update: options in the Google Advanced Search interface. And, again similarly, there is a setting to toggle the syntactic date serving also as the date restrict date (module QualityTimebasedSyntacticDate, attribute syntacticDateNotForRestrict).

Curiously, a date range may be used in place of a pinpoint date restrict date (module QualityTimebasedSyntacticDate, attribute useRangeInsteadOfDateForRestrict). I'm unsure as to how this works in practice, as I don't believe I've observed it. It definitely seems like it could cause confusion with the invariably pinpoint byline date.

Interpretation

Taken together, I'm inclined to think that the most common case is that the syntactic date serves as both the byline date and the date restrict date; alternative permutations would yield counterintuitive results. If the byline date differed from the date restrict date, a user could be presented with a date-restricted result with a byline date that was outside of the restricted date range. If the byline date and date restrict date were the same, but they were both different than the syntactic date, then I'm not sure what purpose the syntactic date would even be serving.

I also infer from the documentation — particularly module QualityTimebasedSyntacticDate, attribute date — and corroborated by Google's public explanations of web content date provenance, that an explicit publication date in the body of the document is a relatively precedential source for the syntactic date. However, there are many potential alternative sources for dates, so I don't see why Google shouldn't be able to assign every document a syntactic date, if not byline and date restrict dates.

In terms of adapting my approach, I'll take greater care going forward to mind potential discrepancies between the byline date and date restrict date (or range). If the byline date or (inferred) date restrict date don't have obvious origins in observable features of the document — e.g., web address syntax, structured metadata in the source code, publication date in the body text, the HTTP last-modified header, etc. — it will be more reasonable to suppose that reflects a crawler-supplied (i.e., not directly observable) source — e.g., last significant update, last crawl, or first crawl.

While I'm grateful for the additional insights provided by the API documentation leak, how Google Search makes date determinations and then applies them for any given web content remains largely opaque. It therefore again bears mentioning that the Internet Archive Wayback Machine will continue to serve as a vital authority for validating web content dates.

PermalinkCreative Commons Attribution-ShareAlike 4.0 International License