Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Archiving Digital Data an Unsolved Problem

Posted by kdawson on Mon Nov 20, 2006 05:54 PM
from the digital-ice-age dept.
mattnyc99 writes, "It's a huge challenge: how to store digital files so future generations can access them, from engineering plans to family photos. The documents of our time are being recorded as bits and bytes with no guarantee of readability down the line. And as technologies change, we may find our files frozen in forgotten formats. Popular Mechanics asks: Will an entire era of human history be lost?" From the article: "[US national archivist] Thibodeau hopes to develop a system that preserves any type of document — created on any application and any computing platform, and delivered on any digital media — for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"
+ -
story
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by UbuntuDupe (970646) * on Monday November 20 2006, @05:56PM (#16921476) Journal
    I can't wait to hear Microsoft's explanation why the project should use one of their proprietary formats.
    • by RAMMS+EIN (578166) on Monday November 20 2006, @07:22PM (#16922666) Homepage Journal
      Our formats are industry standards. They are backed by Microsoft, a robust company which has withstood vigorous competition, lawsuits, the .com burst, and the Bolshevik revolution brought about by Stallman et al. Where other companies have folded, Microsoft has flourished. With a known track record of backward-compatibility, your documents are safe with us. Trust us. We _invented_ trusted computing.

      And remember: nobody ever got fired for buying Microsoft.
  • by Electrode (255874) on Monday November 20 2006, @05:56PM (#16921482) Homepage
    "for as long as the United States remains a republic."

    So, they're shooting for about 10 years then?

    • Re:Not too long... (Score:5, Interesting)

      by eln (21727) on Monday November 20 2006, @06:01PM (#16921580) Homepage
      Your timeline may be a little off (at least I hope so), but you're right that it's a silly goal. Whether the US has 10 or 1000 years left, history shows us it will most likely fall at some point, and that point will be fairly soon when compared to the entirety of human history.

      Making a format that will survive a thousand years so long as our advanced civilization is still around and still cares is pointless, because as long as there is a continuous line of people that care, they will be willing to transfer at least the more important stuff to new media. The trick is coming up with something that will still be readable when archaeologists dig it up 10, 50, or 100 thousand years from now.
      • Re:Not too long... (Score:5, Insightful)

        by thelost (808451) on Monday November 20 2006, @06:19PM (#16921840) Journal
        the trick is... hoping that in a hundred thousand years people still care at all about their past. The slow realization as I read Isaac Asimov's Foundation saga about the origins of the Galactic Empire chilled me, mostly because the people of the empire had become so numb to their past as to have made it vanish entirely.
  • by zappepcs (820751) on Monday November 20 2006, @05:59PM (#16921550) Journal
    than the previous ages where all information was kept on paper or in spoken words? The problem isn't so much how to invent something that will always be readable, but some way to always have the applications to read it. If it were not for the Rosetta Stone, much of what we know about the ancient world might still be a mystery.

    • by quanticle (843097) on Monday November 20 2006, @06:11PM (#16921736) Homepage

      Its different because of the sheer volume of information being created today. Ancient cultures were not creating millions of pages of information every day.

      Your Rosetta Stone analogy is inappropriate. We have not discovered any sort of Rosetta Stone for the ancient Maya hieroglyphs but we have had success in deciphering them because we can apply linguistic analysis techniques to figure out what words correspond to what actions/things. Its a little more complicated for abstract concepts, but you can figure out a surprising amount from basic language knowledge.

      • by ThosLives (686517) on Monday November 20 2006, @06:27PM (#16921964) Journal

        It's not so much the Rosetta stone, but the fact that a "Rosetta stone" has a built-in context - it's obviously communication or artwork of some kind. If you have a big pile of digital data, what is it? An image? Compressed text? Audio? Just a sequence of numbers? The thing "printed" information gives you is that the presentation of the data gives you an idea of what it is - we don't yet have any digital data formats for which the presentation of the data gives an idea of the content; in fact, most digital storage mechanisms present all types of information in identical manner.

        That's the real challenge - devising a digital storage format in which presentation can be used to apply context to the data.

          • by toddestan (632714) on Monday November 20 2006, @09:49PM (#16924164)
            You're assuming far too much. Remember, there are entire written langauges from 2000+ years ago that we barely know how to read. And we have the context of what they were written on, formatting, what the characters look like and things like that. Now, in 2000 years, if someone came upon your harddrive, or flash memory card, or whatever - assuming they could even read it, they aren't going to be able to pop it into a computer and see c:\My Music\ and C:\Documents and Settings\, and the only challenge left is to figure out what the hell an OGG file is. They aren't going to see files. They are going to see 1's and 0's. Lots of them - billions on a memory card and trillions on a harddrive. They won't have a clue know how to interpet the file system, even for something relatively simple like FAT16. They may not even know that a byte is 8 bits. They won't have context, they will be baffled by the fact that most every OS writes files in fragments all over the drive. They likely won't even be tell areas that were marked as deleted but not wiped from the actual data, let along figure out what the swap file is. I seriously doubt that someone in the future, given a working harddisk but nothing else to go on, would be able to pull anything meaningful from the drive. Heck, look at modern day examples - how long did it take Linux to be able to read and write to NTFS, given the number of very smart people working on it who already had a pretty good idea how it functioned?
    • by s20451 (410424) on Monday November 20 2006, @06:23PM (#16921918) Journal
      Say western civilization is disrupted for a period of time that is short by historical standards -- 40-50 years would be enough. Electrical power is only sporadically available, and as a result the Internet collapses and PCs become useless. With much more important issues to deal with, such as finding food, people ignore digital data storage.

      The era of restoration comes. However, when people blow the dust off those old DVDs and players, they discover that the DVDs have decayed to the point of unreadability. Massive quantities of archived data and knowledge are irretrievably lost.

      The main problem in our age is thermodynamics -- information is stored so densely that it tends to decay naturally, on its own. By contrast, ancient stone carvings (as well as their keys, such as the Rosetta stone), are sufficiently durable to last (basically) for ever.

    • by Marxist Hacker 42 (638312) * <seebert@aracnet.com> on Monday November 20 2006, @06:56PM (#16922354) Homepage Journal
      Now that's the right problem. What is needed isn't some mysterious Universal Translator Format- it's storing the read hardware, with programs in ROM that understand the format, along with the electronic copy. Hell, store the whole thing in ROM chips with a well documented interface printed on the outside of the chip. Libraries could be made up of whatever reading technology exists at the time the library is built- with this common pin-level interface.
  • by IWantMoreSpamPlease (571972) on Monday November 20 2006, @05:59PM (#16921558) Homepage Journal
    Worked for the Egyptians didn't it?
  • by csoto (220540) on Monday November 20 2006, @06:00PM (#16921568)
    Working at a University, this is not a subject I'm not unfamiliar with. We've had lots of discussions about this. Everyone always talks about how many zillions of "pieces of information" are out there. The number of web pages in existence is always brandied about. My point in these discussions is that most of what's out there is crap. Humanity is not lessened by its loss. Good stuff gets reproduced, reviewed, studied, dissected, etc. and survives. It *is* stupid to try to solve this problem, because the problem doesn't need solving.
    • by failedlogic (627314) on Monday November 20 2006, @06:07PM (#16921682)
      Things like music, TV shows, movies, literature, toys, magazines etc are all cultural products. For future generations we need to keep records of there items as much as family trees, great stories, buldings, etc.

      Besides, who's to decide what is 'crap' or not. It might be that to the untrained eye, a clay pot from Egypt might not look interesting. The color, shape, its condition, etc might tell someone who used it, why, what cultural value (symbology, usefullness, etc) the pot actually had. And culture evolves from culture. Keeping a record of everything we product allows future generations to inform themselves of who we were and what we did. Quality of the information itself is really unimportant.

      Only thing I'd have to add: I wish future generations all the luck in sorting through our garbage piles and recycling/salvaging what they can. If anything, this amount of waste - or crap - is a record of us as much as anything. I can agree with you on this point about crap in our culture!!! ;)
    • by kfg (145172) on Monday November 20 2006, @06:09PM (#16921722)
      Expanding copyright protection to a term equal to two lifetimes means that now even some of the good stuff is being lost because it is not allowed to preserve it.

      If preservation is outlawed, only outlaws will be preservationists.

      I believe Ray Bradbury had something to say on this subject.

      KFG
  • by OfNoAccount (906368) on Monday November 20 2006, @06:02PM (#16921598)
    Since I shoot RAW, I also burn a copy of dcraw.c [cybercom.net] onto every disc - so even if the current platforms get lost by the wayside, there will be code to convert them still.

    Storage itself? Currently burning onto Delkin Archival Gold [delkin.com], storing cool and dark, and in two physically distant locations.

    They're also stored on my harddisk, and the best are backed up onto a USB drive.

    If it looks like the DVD-ROM drive is becoming obsolete I'll burn them on to whatever comes along next.

    If you're truly paranoid you can always print them on archival quality paper using pigment based inks ;)
  • by Daniel_Staal (609844) <DStaal@usa.net> on Monday November 20 2006, @06:03PM (#16921606)
    There are only two ways of doing this: keeping a copy of every program used to create these files (and a system to run them on) or converting them to some open and well-supported format.

    For text documents, HTML is probably the best bet. It is so widely used and supported readers are almost garunteed to exist as long as computers do in their current form. (And if something ever truely supersedes it, a mass-conversion program will be written anyway.) HTML probably works for basic spreadsheets too. Graphics support for GIF, JPEG, and PNG is probably at that level as well, and MP3 for music.

    As a bonus, most of the native programs for the documents to be preserved have translators to these formats already.

    Beyond that I have no idea.
  • by susano_otter (123650) on Monday November 20 2006, @06:05PM (#16921640) Homepage
    From TSA: "Popular Mechanics asks: Will an entire era of human history be lost?"

    Obviously not; Popular Mechanics itself has preserved much of the era in traditional hardcopy formats, making it no less lossy than previous printed-word eras.

    Of course, understanding the era from such incomplete and unreliable records will be a challenge to archaeologists and historians; again, not much different from previous eras.

    In conclusion: doesn't matter, hardly news.
  • by ThatsNotFunny (775189) on Monday November 20 2006, @06:05PM (#16921644) Homepage
    When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"


    I'd trust that guy. If there's one thing our governrment knows, it's stupidity.
  • The solution (Score:4, Interesting)

    by alexwcovington (855979) on Monday November 20 2006, @06:07PM (#16921672) Journal
    In this era of virtualization, the solution for x86 software is as easy as retaining a copy of the primary partition of a computer originally used to work with the desired files. Searchability could be a problem for proprietary data formats, but the move to open standards in the future will mitigate that.

    The real problem is 60 years of archives of antiquated, proprietary, task-spcific and mainframe computer data cards and tapes whose original programmers are halfway to cedar boxes; if the government can't get their support in time it may as well call all the early stuff a loss and hand it over to archaeologists.
  • by pclminion (145572) on Monday November 20 2006, @06:09PM (#16921704)

    It really isn't a question WHETHER we will be able to read old digital data in the future. After all, humans invented these formats, flawed as they may be, and humans can decipher them with enough effort. We can crack cryptography -- a deliberate attempt to make it as difficult as possible to decipher certain information. So it's hard to imagine any data format that could not be deciphered in the future with some honest effort.

    Instead it is a question of whether the data is WORTH the effort. From an anthropological standpoint, this is valuable historical data, and its value is not decreased by our inability to interpret it. The benefit of digital data is that it can be copied even if we don't know what it means. It will not erode or decay like other historical artifacts, if we put in the small effort required to preserve it. Assuming humanity doesn't self-destruct, there will be plenty of time in the future for historians to decipher and interpret the data when a need arises for it.