The search engine optimization community has been abuzz for the last couple of months with the inadvertent leak of a bundle of internal Google Search API documentation, revealing unprecedented detail about how the ranking algorithm works. Given my own efforts at deciphering the behavior of the Google Search date operators, I was particularly interested in what insights the documentation might contribute or corroborate on that front.
To that end, I mined (a copy of) the documentation for all references to temporalized web content. Here I'll walk through some of the more noteworthy findings. In summary, there are some nuances to how Google handles temporal context that I likely couldn't have discerned experimentally, but unfortunately it's not enough to really now know how Google assigns a date to any given web content.
The syntactic date and its provenance
Google calls the key operative date the syntactic date (module NlpSaftDocument, attribute syntacticDate). The syntactic date may evidently be derived from various sources; the two examples provided in the attribute description are from the web address (module CompositeDoc, attribute urlDate) or from the document title. Elsewhere in the documentation, it is suggested that the syntactic date may alternatively be derived from an explicit document publication date (module QualityTimebasedSyntacticDate, attribute date).
Google tracks many other types of temporal information:
- The HTTP last-modified timestamp of a given document (module CrawlerChangerateUrlVersion, attribute lastModified);
- The timestamp of when a given document was (last?) crawled (module CrawlerChangerateUrlVersion, attribute timestamp);
- The age of the content of a given document (module NlpSaftDocument, attribute contentage);
- The timestamp of when a given document was first crawled (module NlpSaftDocument, attribute contentFirstseen);
- The timestamp of the last "significant update" (i.e., undefined as to how this threshold is met) to the document content (module NlpSaftDocument, attribute lastSignificantUpdate);
- The age of the domain for a given document (PerDocData, attribute domainAge); and
- The age of the host for a given document (PerDocData, attribute hostAge).
It's reasonable to suppose that these attributes may either inform or serve as potential sources for the syntactic date, but the relationships and order of precedence aren't entirely clear from the documentation.
In Google's blog post on the provenance of webpage dates in their index, both the explicit document publication date (module QualityTimebasedSyntacticDate, attribute date) and the last significant update (module NlpSaftDocument, attribute lastSignificantUpdate) are highlighted as example sources, so we know that at least those are applicable, in addition to the example sources mentioned in the syntactic date attribute description (module NlpSaftDocument, attribute syntacticDate).
Other temporal information unrelated(?) to the syntactic date
Google stores other kinds of temporal information, but the documentation doesn't appear to specify whether or how these bear on the syntactic date. For example:
- Google stores the last 20 observed changes to a given document (module CompositeDocIndexingInfo, attribute urlHistory). They appear to discard changes beyond the most recent 20. Commentators suggest that the historical versions of the webpage still influence ranking, even though they are not accessible to the user.
- Google stores metadata on the source of its last significant update timestamp (module PerDocData, attribute lastSignificantUpdateInfo and module QualityTimebasedLastSignificantUpdate, attribute source).
- Google specifies a so-called semantic date, which is the estimated date of a given document's content (module PerDocData, attribute semanticDate). Confidence scores for the year, month, and day components of the semantic date are also stored (module PerDocData, attribute semanticDateInfo).
Search snippet display date provenance
Google calls the date displayed in a search result snippet the byline date (module NlpSaftDocument, attribute bylineDate).
The syntactic date may be used as the byline date (module QualityTimebasedSyntacticDate, attribute useAsBylineDate). This appears to depend on a confidence weighting of the syntactic date as byline date (module QualityTimebasedSyntacticDate, attribute info), that is supported in part by comparison with a measurement of the content's age (module NlpSaftDocument, attribute contentage).
Notwithstanding a high-confidence syntactic date, for whatever reason there is a mechanism to toggle that serving also as the byline date (module QualityTimebasedSyntacticDate, attribute trustSyntacticDateInRanking).
Operative date for date filtering
As with the byline date, it appears that the syntactic date serves as the presumptive source for the date restrict date - i.e., the operative date for the before: and after: date operators as well as the last update: options in the Google Advanced Search interface. And, again similarly, there is a setting to toggle the syntactic date serving also as the date restrict date (module QualityTimebasedSyntacticDate, attribute syntacticDateNotForRestrict).
Curiously, a date range may be used in place of a pinpoint date restrict date (module QualityTimebasedSyntacticDate, attribute useRangeInsteadOfDateForRestrict). I'm unsure as to how this works in practice, as I don't believe I've observed it. It definitely seems like it could cause confusion with the invariably pinpoint byline date.
Interpretation
Taken together, I'm inclined to think that the most common case is that the syntactic date serves as both the byline date and the date restrict date; alternative permutations would yield counterintuitive results. If the byline date differed from the date restrict date, a user could be presented with a date-restricted result with a byline date that was outside of the restricted date range. If the byline date and date restrict date were the same, but they were both different than the syntactic date, then I'm not sure what purpose the syntactic date would even be serving.
I also infer from the documentation — particularly module QualityTimebasedSyntacticDate, attribute date — and corroborated by Google's public explanations of web content date provenance, that an explicit publication date in the body of the document is a relatively precedential source for the syntactic date. However, there are many potential alternative sources for dates, so I don't see why Google shouldn't be able to assign every document a syntactic date, if not byline and date restrict dates.
In terms of adapting my approach, I'll take greater care going forward to mind potential discrepancies between the byline date and date restrict date (or range). If the byline date or (inferred) date restrict date don't have obvious origins in observable features of the document — e.g., web address syntax, structured metadata in the source code, publication date in the body text, the HTTP last-modified header, etc. — it will be more reasonable to suppose that reflects a crawler-supplied (i.e., not directly observable) source — e.g., last significant update, last crawl, or first crawl.
While I'm grateful for the additional insights provided by the API documentation leak, how Google Search makes date determinations and then applies them for any given web content remains largely opaque. It therefore again bears mentioning that the Internet Archive Wayback Machine will continue to serve as a vital authority for validating web content dates.