Transferring "Libraries of Congress" of Data

Published by Nicholas Taylor on 11 July 2011

If science reporters, IT industry pundits and digital storage and network infrastructure purveyors are to be believed, devices are being lab-tested even now that can store all of the data in the Library of Congress or transmit it over a network in mere moments. To this list of improbable claims, I'd like to add another: by the most conservative estimates, I transfer more than a Library of Congress' worth of data to the Library of Congress every month.

Clearly, that doesn't make any sense, but allow me to explain. You may have noticed that the "data stored by the Library of Congress" has become a popular, if unusual, unit of measurement for capacity (and the subject of a previous Library of Congress blog post, to boot). More cautious commentators instead employ the "data represented by the digitized print collections of the Library of Congress." My non-exhaustive research (nonetheless, corroborated by Wikipedia) suggests that in instances where a specific number is quoted, that number is most frequently 10 terabytes (and, in a curious bit of self-referentiality, the Library of Congress Web Archiving program is referenced in Wikipedia to help illustrate what a "Terabyte"" is). From whither, 10 terabytes?

The earliest authoritative reference to the 10 terabytes number comes from an ambitious 2000 study by UC Berkeley iSchool professors Peter Lyman and Hal Varian which attempted to measure how much information was produced in the world that year. In it, they note with little fanfare that 10 terabytes is the size of the Library of Congress print collections. They subsequently elaborate their assumptions in an appendix: the average book has 300 pages, is scanned as a 600 DPI TIFF, and, finally, compressed, resulting in an estimated size of 8 megabytes per book. At the time of the study's publication, they supposed that the Library of Congress print collections consisted of 26 million books. Even taking these assumptions for granted, the math yields a number much closer to 200 terabytes. Sure enough, the authors note parenthetically elsewhere in the study that the size of the Library of Congress print collections is 208 terabytes. No explanation is offered for the discrepancy with the other quoted number.

"DATABASE at Postmasters, March 2009" by Michael Mandiberg under CC BY-SA 2.0

For whatever reason, though, it's the 10 terabyte figure that took hold in the public's imagination. To be sure, 10 terabytes is an impressive amount of data, but it's far less impressive than the amount of data that the Library of Congress actually contains (and, I suspect, even just counting the print collections). While I'm neither clever nor naïve enough to propose what a more realistic number might be, returning to my original provocation, I did wish to further discuss a digital collection I know quite well: the Library of Congress Web Archives.

As explained previously in The Signal, we currently contract with the Internet Archive to perform our large-scale web crawling. One ancillary task that arises from this arrangement is that the generated web archive data (roughly 5 terabytes per month) must be transferred from the West Coast to the Library of Congress. This turns out to be non-trivial; it may take the better part of a month with near-constant transfers over an Internet2 connection to move 10 terabytes of data. For all the optimism about transmitting "Libraries of Congress" of data over networks, putting data on physical storage media and then shipping that media around remains a surprisingly competitive alternative. Case in point: for all of the ethereality and technological sophistication implied by so-called cloud services, at least one of the major providers lets users upload their data in the comparatively mundane manner of mailing a hard drive.

Of course, transfer is just the initial stage in our management of the web archive data; the infrastructure demands compound when you consider the requirements for redundant storage on tape and/or spinning disk, internal network bandwidth, and processor cycles for copying, indexing, validation, and so forth. In summary, I doubt that we have spare capacity to store and process many more "Libraries of Congress" of data than we are currently (though perhaps that's self-evident).

Suffice it to say, I look forward to a day when IT hardware manufacturers can legitimately claim to handle magnitudes of data commensurate with what is actually stored within the Library of Congress (whatever that amount may be). In the meantime, however, I suppose I'd settle for the popular adoption of fractional “Library of Congress” units of capacity (e.g., “.000001% of the data stored at the Library of Congress”) – likely no more or less realistic than what the actual number might be, but at least it'd more appropriately aggrandize just how much data the Library of Congress has.

Permalink | Crossposted to The Signal