The Immeasurable Library of Congress

Published by Nicholas Taylor on 6 August 2012

Online photo platforms increasingly support the precise positioning and browsing of user-submitted images in a three-dimensional space. Unsurprisingly, the geographic locations that are most thoroughly blanketed with photos correspond to popular tourist attractions; so many photos are taken as to construct "complete" digital facsimiles. These digitized versions are never really complete, though, because no number of photos could perfectly capture and represent every aspect of the originals. It is for much the same reason that the data stored in the Library of Congress won't fit on a 10 terabyte hard drive.

"Ferry Building" by Kyle McDonald under CC BY-NC-SA 2.0

That may seem like a pretty intuitive conclusion to reach without further argument, but, as documented previously, counterexamples abound. To recap briefly, the popular estimation that the Library of Congress represents 10 terabytes of data came from a 2000 study by two UC Berkeley iSchool professors. Several critical, though often overlooked, caveats were that this number only figured the print collections (PDF), and that they treated the contents as ASCII text (PDF). In other words, for the purpose of the study, a print book or periodical was counted as the amount of disk space it occupied when reduced to plaintext.

Science reporters, IT industry pundits, and digital storage and network infrastructure purveyors have gone on to popularize the use of 10 terabytes as a "Library of Congress of data" as a straw-man for a large amount of data. The tacit critique that so much more data is being created, transmitted and stored via digital technologies than is housed in the putatively non-digital Library of Congress is that the latter is somehow less relevant in the modern information environment. Moreover, it imagines that the size of the collections is the Library of Congress' singularly salient characteristic.

What this misses is both the importance of the materiality of information and the role of libraries generally. As Trevor Owens argued in a previous post, an object like a book effectively contains an inexhaustible amount of information – no different than the irreducible tourist attraction whose digital representation is forever incomplete. In making this argument, he refers back to another post by Sarah Werner, who points out that text is only the most conspicuous information that a book might contain. To say nothing of images, the ink colors, the paper quality, the scent, the wear patterns and the artifacts of how "identical" books may differ all represent information that is elided entirely by the notion that an "information-complete" digitized surrogate consists of nothing more than the text itself.

At the end of the day, being able to discretely define the amount of data stored in the Library of Congress isn't much more useful than being able to discretely define the weight of all of the books in its collections – that is unless your business happens to be building cranes and you want to impress people with how many Libraries of Congress' weight of books they're capable of lifting. That's not to say that I think that the Library of Congress is going to win a competition with the Internet when it comes to data volume, but it is to say that data volume has little to do with the more important work of stewardship. The Library of Alexandria is reputed to have had extensive collections, too, after all.

The Library of Congress' acquisition and preservation of physical materials reflects the understanding that "thinginess" matters. The Gutenberg Bible is not interchangeably 4.5 megabytes of text; an ASCII version of the Declaration of Independence does not submit to a palimpsestic analysis of its landmark revisions; and the Library of Congress is more than just a hugely inefficient mechanism for storing a large amount of text data.

Permalink | Crossposted to The Signal