Forgot your password?
typodupeerror
Data Storage

Archiving Digital History at the NARA 202

Posted by timothy
from the sort-and-toss dept.
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
This discussion has been archived. No new comments can be posted.

Archiving Digital History at the NARA

Comments Filter:
  • by Leontes (653331) on Sunday June 26, 2005 @05:54PM (#12916223)
    The ancient, esteemedgreat library of alexandria [wikipedia.org] was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?

    Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

    Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.

    It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?
  • by ArchAngel21x (678202) on Sunday June 26, 2005 @05:56PM (#12916232)
    I guess you didn't see how Mr. Ebbers or the founder of Aldephia are facing prison time. Quit trying to spread that liberal lie that white collar crime pays off. By the way, it is inappropiate to refer to blacks as niggers. Grow up and learn to be a little more tolerant of diversity.
  • by Doc Ruby (173196) on Sunday June 26, 2005 @06:17PM (#12916329) Homepage Journal
    We need to imprint holographic storage on synthetic diamonds. Even if they're slow and expensive, they'll last even longer than the paper records they replace. We'll have to spend a fortune redigitizing all the polymer (CD/DVD, floppy, tape), celluloid (microfilm/fiche) and rotating (disc) media that will age to illegibility within our lifetimes. Until we get holographic gems, we need to archive everything on paper, including those expiring media, in a format easily digitized to a more permanent medium. But of course the government, and barely unaccountable bosses, want the public record to disappear down the memory hole. If they could accelerate the process, including newspapers, they'd spend everything we've got (and more) to make it happen.
  • by MasterC (70492) <cmlburnett@gm[ ].com ['ail' in gap]> on Sunday June 26, 2005 @06:27PM (#12916369) Homepage
    The only thing that comes to mind is information entropy [wikipedia.org]. If you're given a text document, you can determine the probability distribution for each letter, letter combinations, for words, or whatever you can think of. Then given the probability distribution, you can determine the information entropy. If, in the sum, you use log with base 2 then H(x) (see formal definitions [wikipedia.org]) gives you the entropy in bits.

    For example, if you have a text file with letters of equal probability (all letters have a probability of 1/27) then the bits required to represent a single letter turns out to be ~4.7549 bits. (Indeed, 2^4.7549 = 27)

    This is the upper limit of compression. Such methods as the, now 50-years old, Huffman coding [wikipedia.org] do decent work at approaching this limit (used in JPEG, for one).

    So the answer to your question is: it's not broadly definiable for "text" or "information" but based on the patterns of the English language or a specific document.
  • Re:347 petabytes? (Score:1, Interesting)

    by Anonymous Coward on Sunday June 26, 2005 @06:30PM (#12916396)
    Wow. I didn't know one could mess up so simple math so badly... That's a simple rule of thirds - basic high school math!

    120GB/2Hr = 60GB/h indexing speed.
    347PB = 347 000TB = 347 000 000GB (or use 347 x 1048576 - but HD manufacturers never use that - they like to inflate numbers)
    347 000 000GB / (60GB/h) = 5783333 (and 1/3) h.
    at 24h/day, 365d/yr, we get 660 years.

    You were just a little over 12 times too much. Let's just hop you don't write code for a living :p

    Still bloody too much, but it's not like the indexing is going to be done by a single processor across a single bus. Anything like that has got to be done by means of distributed computing (duh), so this math is completely irrelevant anyways :)

    And it's not like spotlight is much of a reference either, perhaps make comparisons with big commercial indexing solutions, or open source implementations that could be scaled...

    Making a comparison with distributed indexing of rendundant network storage of some sort with a local IDE disk indexing by spotlight is just laughable. Apples and oranges.
  • by mrogers (85392) on Sunday June 26, 2005 @06:30PM (#12916408)
    Doesn't it diminish the aura of a great work of art if you know that it can always be restored from a backup?
  • by G4from128k (686170) on Sunday June 26, 2005 @06:40PM (#12916477)
    In 1987, a Mac II came with a 40 MB drive. 17 years later, a PowerMac G5 came with 160 GB drive. This was at least 4000X improvement in storage density and price (and 1987's drive was both physically larger and more expensive than 2004's drive).

    Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.
  • by dpbsmith (263124) on Sunday June 26, 2005 @06:41PM (#12916484) Homepage
    The Zapruder film was the beginning. In recent years, I've been dumbfounded by the vast extension in recording and documentation of things like crimes in progress, natural disasters, America's Funniest Home Videos, you name it. A plane crashes, and the next day there are ten different home videos from people in the vicinity who had camcorders.

    I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.

    And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.

    Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.

    So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.

    But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.

    It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.
  • by G4from128k (686170) on Sunday June 26, 2005 @07:34PM (#12916780)
    I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.

    Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor of digital (and probably 1000:1). Inaccessible formats are a concern, but an automated batch process at the time of initial archiving can, at least, convert the data to some data format standard with a longer likely lifespan(e.g., plain ASCII, RTF, PDF, HTML, etc.)

    Paper is its own single-point of failure concerns and the huge cost of copying makes those concerns real. Digital does add some new modes of failure (e.g., format obsolesce), but I think those are not as burdensome as the physical costs of copies.

Philogyny recapitulates erogeny; erogeny recapitulates philogyny.

Working...