Saturday, June 18, 2016

One Library of Congress

 on  with No comments 
In ,  
One of my favorite units of measurement that has been thrown around is "One Library of Congress."  Particularly on Slashdot, armchair storage engineers and generalists alike throw around this measurement when talking about astronomical amounts of data.  Often times, posters will talk in terms of data volume being "the equivalent of three Libraries of Congress" or data transfers "at the speed of one Library of Congress traveling by station wagon."  so let's just get the real fact of the matter out of the way now.  It will probably never be known just how much information is stored in the Library of Congress.  There's just too many variables that are still unknown.  Many estimates exist, with some being better than others. But at the end of the day they're still just that, estimates.  I'm also not bothering to see just how many CDs I can fit into a station wagon, one of my favorite methods of fitting one Library of Congress into the mythical station wagon.  A few cents each still adds up when we're talking about that many discs.

In 2000, UC Berkley professors Peter Lyman and Hal Varian weighed in with what is believed to be one of the earliest authoritative estimates on how much information was produced in that year.  Besides the stated goal of the research, they estimated that the Library of Congress print collection contains 10 TB of data, a figure that is often still cited today.  This number is based off of the average book containing 300 pages, which if scanned at 600 DPI in the TIFF format and then compressed, would be an average of 8MB per book.  With the print collections consisting of 26 million books at the time, their math should have been closer to around 200TB, clearly indicating that 10 TB was just a guess.  This is also only accounting for textual data, as images would change that number.  Audio, video, photographs, and other forms of nontextural data would also vastly increase that number.

An area that is able to be better estimated is the Web Archiving program, which has collected 525 TB of data itself as of July 2014. A Library of Congress storage engineer by the name of Carl Watts (now there's a position that'll really let you demonstrate your skills, or more likely your lack of skills,in storage and backup) gave an estimate of 27 petabytes of data in September 2012.  In comparison, it was estimated that global data would grow to 2.7 zettabytes during 2012, up 48% from 2011.  And in 2008, Americans consumed 3.6 zettabytes of information.   So back on the topic at hand, we'll probably never know just how much data is stored in the Library, especially when looking at it in terms of bits and bytes when there's so many dead trees still containing the data.

Since it's a nice large number that comes from an authoritative source, let's just go with 27 petabytes for now.  The largest HDD that I can purchase at NewEgg today is 8TB. I know there are larger (and no doubt costlier drives based on GB per dollar), but I'm just talking what is commonly available.  By my math, that'll be 3456 of those 8TB models, before taking into account additional drives for parity in RAID sets, lost space due to overhead (filesystem use, files not taking up the entire sector), etc. And for now I'm going to ignore the whole 1000 bytes vs. 1024 bytes in a kilobyte argument that the HDD manufacturers have put upon us. That will only lead to more drives needed.

Since the general consensus is that you shouldn't be building RAID5 sets with drives that big anyway (rebuilding an array will take days, putting unnecessarily stress on other drives which may in turn cause others to fail as well), we'll take extra drives for parity out of the equation.  That eliminates the need to worry about just how many RAID5 or RAID6 arrays we should be building out of that many discs.  So let's go ahead and bump that up to a nice even 3500 drives to account for space lost to overhead.  At an average of $300 per drive (I never buy the cheapest, nor the most expensive), we're at $1,050,000 to house one copy.  And of course that's just the drives, I haven't even begun to factor in the servers required to house them, the electricity (regular, generator, and battery backup) required to keep them spinning, or the air conditioning required to keep them from melting down.  Hopefully you can get that down some by purchasing in bulk, but even then it's going to be a pretty big number.  And certainly with that much data, you're going to want to have a good backup strategy, as in more than one copy.  And no, we're not going to call the physical books the backup.

And in case you were curious, you'd have to fit over 21 million CDs into the back of that station wagon in order to calculate a transfer rate.  We can get that down around 5 million if we move up to DVDs, and down to 416,000 if we move to Blu-Ray discs.  We'll be better off, at least in terms of sanity, if we load the station wagon up with 3500 external 8TB HDD enclosures.  I still recall backing up data to countless CD-R discs back in the day, and it was not fun sitting there switching discs in and out of the drive every few minutes.  Another idea is that we can do the transfers with tape, the largest of which I can find are 185TB in size, though the fact that the articles all date from 2014 and there is still no actual product that I can find may indicate that this technology is simply vaporware.  We should be able to get a Library of Congress on 150 of these tape cartridges.

And based on the fastest cross country trip on record, this station wagon going from New York to Las Angeles would be moving data at approximately 2.2Tbps according to my math, which I backed up with this handy bandwidth calculator.  This is of course assuming that you can get all 27 petabytes into a single station wagon, which may still be a bit of a challenge since those 185GB tape cartridges from Sony don't appear to be a purchasable product just yet.  But even if you have to send out 4 or 5 station wagons, that's still a lot better than what I'm getting on Comcast.

Leslie Johnston of the Library of Congress has her own take (as well as a lively discussion in the comments) on just how much data is housed by the Library, including a collection of comparisons from around the web.  She also posted a follow up (along with more great comments) here.  I found the discussion in the comments about just how many books exist (all time) and what exactly constitutes a book to be pretty interesting.  Another take on this book debate from Google.  Matt Raymond, also of the Library of Congress, gives his take on it here.

Contel Bradford, an apparent fellow Detroiter, posted another interesting take on the Library of Congress, as well as the state of libraries in general today, at the StorageCraft Recovery Zone blog.  Contel breaks down the contents of the library, including some analysis of the audio and video assets.  Recovery Zone is StorageCraft's blog that is dedicated to "exploring BDR solutions and technologies relevant to MSPs, VARs and IT professionals."

So how much data is actually contained in the Library of Congress?  We're going to have to settle for "a lot" as the final answer.  We'll probably never really know for sure.


Post a Comment

Discuss this post!