Archiving Digital History at the NARA 202
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
burn, knowledge, burn (Score:3, Interesting)
Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?
Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.
It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?
Re:Usually when I archive... (Score:2, Interesting)
Tanks for the Memories (Score:3, Interesting)
Re:Answer is Compression? (Score:3, Interesting)
For example, if you have a text file with letters of equal probability (all letters have a probability of 1/27) then the bits required to represent a single letter turns out to be ~4.7549 bits. (Indeed, 2^4.7549 = 27)
This is the upper limit of compression. Such methods as the, now 50-years old, Huffman coding [wikipedia.org] do decent work at approaching this limit (used in JPEG, for one).
So the answer to your question is: it's not broadly definiable for "text" or "information" but based on the patterns of the English language or a specific document.
Re:347 petabytes? (Score:1, Interesting)
120GB/2Hr = 60GB/h indexing speed.
347PB = 347 000TB = 347 000 000GB (or use 347 x 1048576 - but HD manufacturers never use that - they like to inflate numbers)
347 000 000GB / (60GB/h) = 5783333 (and 1/3) h.
at 24h/day, 365d/yr, we get 660 years.
You were just a little over 12 times too much. Let's just hop you don't write code for a living
Still bloody too much, but it's not like the indexing is going to be done by a single processor across a single bus. Anything like that has got to be done by means of distributed computing (duh), so this math is completely irrelevant anyways
And it's not like spotlight is much of a reference either, perhaps make comparisons with big commercial indexing solutions, or open source implementations that could be scaled...
Making a comparison with distributed indexing of rendundant network storage of some sort with a local IDE disk indexing by spotlight is just laughable. Apples and oranges.
Re:burn, knowledge, burn (Score:2, Interesting)
Moore's Law saves the day (Score:3, Interesting)
Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.
I'm guessing... steady state. (Score:4, Interesting)
I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.
And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.
Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.
So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.
But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.
It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.
Cost-of-copy and modes of failure (Score:3, Interesting)
Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor of digital (and probably 1000:1). Inaccessible formats are a concern, but an automated batch process at the time of initial archiving can, at least, convert the data to some data format standard with a longer likely lifespan(e.g., plain ASCII, RTF, PDF, HTML, etc.)
Paper is its own single-point of failure concerns and the huge cost of copying makes those concerns real. Digital does add some new modes of failure (e.g., format obsolesce), but I think those are not as burdensome as the physical costs of copies.