Slashdot Log In
Archiving Digital Data an Unsolved Problem
Posted by
kdawson
on Mon Nov 20, 2006 04:54 PM
from the digital-ice-age dept.
from the digital-ice-age dept.
mattnyc99 writes, "It's a huge challenge: how to store digital files so future generations can access them, from engineering plans to family photos. The documents of our time are being recorded as bits and bytes with no guarantee of readability down the line. And as technologies change, we may find our files frozen in forgotten formats. Popular Mechanics asks: Will an entire era of human history be lost?" From the article: "[US national archivist] Thibodeau hopes to develop a system that preserves any type of document — created on any application and any computing platform, and delivered on any digital media — for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Microsoft to help! (Score:5, Funny)
Re:Microsoft to help! (Score:5, Funny)
And remember: nobody ever got fired for buying Microsoft.
Parent
"Plays for Sure" vs Zune for Office? (Score:4, Insightful)
Yep. Microsoft's commitment to their "Plays for Sure" campaign with the Zune really instills confidence in their backwards compatability.
At least with OpenOffice I can legally archive the source code and install images needed to access the data for that period (say, every year or six months.) Sort of like dropping a copy of TrueCrypt on a DVD full of crypto archives.
With the new DRM keys and license enforcement policies, I dread someday trying to resurrect an old image so I can access data archives, only to find it wants to register with a DRM verification service that no longer runs or is no longer compatible with a 4-5 year old install image.
Parent
Re:Microsoft to help! (Score:4, Funny)
That's not a word.
Parent
Re:Who cares? (Score:5, Funny)
Parent
Not too long... (Score:5, Funny)
So, they're shooting for about 10 years then?
Re: (Score:3, Funny)
10 years or the next presidential election - whichever comes first
Re:Not too long... (Score:5, Interesting)
Making a format that will survive a thousand years so long as our advanced civilization is still around and still cares is pointless, because as long as there is a continuous line of people that care, they will be willing to transfer at least the more important stuff to new media. The trick is coming up with something that will still be readable when archaeologists dig it up 10, 50, or 100 thousand years from now.
Parent
Re: (Score:3, Insightful)
Re:Not too long... (Score:4, Interesting)
Archaeology is the search for fact. Not truth. If it's truth you're interested in, Doctor Tyree's Philosophy class is right down the hall. So forget any ideas you've got about lost cities, exotic travel, and digging up the world. We do not follow maps to buried treasure, and 'X' never, ever marks the spot. Seventy percent of all archaeology is done in the library. Research. Reading.
-- Indiana Jones and the Last CrusadeParent
Re:Not too long... (Score:5, Insightful)
Parent
Re: (Score:3, Insightful)
As much as anything, it seems like we might worry about people rewriting the past. It'd be hard to edit part of one of the original copies of the US Constitution without anyone being able to tell the difference, because we actually have a really old piece of paper that someone would have to get access to, somehow erase some ink, and write over top with identical ink.
But a historical document in the form of a text file on someone's hard drive? That can be edited without a trace.
Re:Not too long... (Score:5, Funny)
Are you trying to say she didn't do that?
Crap, I am so getting an F on my history paper.
Parent
Re:Not too long... (Score:5, Funny)
Parent
Re:Not too long... (Score:4, Interesting)
History is interesting, school makes it suck: "In Year ABC, XYZ happened. Test next week - students who regurgitate well will get an 'A'."
People don't want to be sheep - totalitarian governments need populations to be docile. School is designed to suck the uniqueness out of children so, as adults, they'll take up a spot on a standardized assembly line.
Kinda cruel how the government has encouraged the shipping of assembly line jobs to China... Dumb down the population, then get rid of the reason for the dumbing-down.
See Gatto's Underground History [johntaylorgatto.com], for example.
Parent
Re:Forgotten (Score:4, Funny)
Parent
How is this different (Score:5, Insightful)
Re:How is this different (Score:4, Interesting)
Its different because of the sheer volume of information being created today. Ancient cultures were not creating millions of pages of information every day.
Your Rosetta Stone analogy is inappropriate. We have not discovered any sort of Rosetta Stone for the ancient Maya hieroglyphs but we have had success in deciphering them because we can apply linguistic analysis techniques to figure out what words correspond to what actions/things. Its a little more complicated for abstract concepts, but you can figure out a surprising amount from basic language knowledge.
Parent
Re:How is this different (Score:5, Insightful)
It's not so much the Rosetta stone, but the fact that a "Rosetta stone" has a built-in context - it's obviously communication or artwork of some kind. If you have a big pile of digital data, what is it? An image? Compressed text? Audio? Just a sequence of numbers? The thing "printed" information gives you is that the presentation of the data gives you an idea of what it is - we don't yet have any digital data formats for which the presentation of the data gives an idea of the content; in fact, most digital storage mechanisms present all types of information in identical manner.
That's the real challenge - devising a digital storage format in which presentation can be used to apply context to the data.
Parent
Re:How is this different (Score:4, Funny)
That's what MIME types are for. Duh.
Parent
Re:How is this different (Score:5, Insightful)
Parent
Re:How is this different (Score:4, Interesting)
They might not know that a byte is 8 bits, but with a little analysis, it shouldn't be hard to figure out. There are numerous statistical properties that can be exploited to figure this out relatively easily. For example, with most types of data, the higher-order bits (in any size byte) are more likely to be 0 than the lower-order bits are. Think about how booleans are stored in most systems. Think about the characters in this message: 100% of them have a zero high-order bit. To put it a little differently, there is more entropy in the lower-order bits.
So, to figure out how many bits there are in a byte, you take your data, and for all reasonable sizes of bytes (say, from 4 bit bytes up to 36 bit bytes), you compute the function that maps bit position (low- or high-order) to an entropy value for that bit. Then you can tell by the shape of that curve which guess about bits per byte was the right guess. Heck, it should be such a strong trend that you can probably automate it!
Remember that future civilizations will probably also use digital data as well, at least ones sophisticated enough to try to read the optical and magnetic media. They may not know the FAT32 filesystem, but they will have invented statistics and information theory, and they will be able to make some awfully good guesses at things. And yeah, it might take them 10 or 20 years to be able to read a FAT32 volume correctly if some poor college student of the distant future has to do it on a shoestring budget of grant money, but if they're reading 10,000 year old data, how much does that matter?
Parent
Re: (Score:3, Insightful)
How is this different than the previous ages where all information was kept on paper or in spoken words?
Paper actually holds up rather well as an archival medium. Plus, you don't need specialized technology to read it.
Re:How is this different (Score:5, Interesting)
The era of restoration comes. However, when people blow the dust off those old DVDs and players, they discover that the DVDs have decayed to the point of unreadability. Massive quantities of archived data and knowledge are irretrievably lost.
The main problem in our age is thermodynamics -- information is stored so densely that it tends to decay naturally, on its own. By contrast, ancient stone carvings (as well as their keys, such as the Rosetta stone), are sufficiently durable to last (basically) for ever.
Parent
Re:How is this different (Score:4, Interesting)
Parent
Re:How is this different (Score:5, Interesting)
Parent
hieroglyphics (Score:5, Funny)
I've heard this problem over and over (Score:5, Interesting)
Re:I've heard this problem over and over (Score:4, Insightful)
Besides, who's to decide what is 'crap' or not. It might be that to the untrained eye, a clay pot from Egypt might not look interesting. The color, shape, its condition, etc might tell someone who used it, why, what cultural value (symbology, usefullness, etc) the pot actually had. And culture evolves from culture. Keeping a record of everything we product allows future generations to inform themselves of who we were and what we did. Quality of the information itself is really unimportant.
Only thing I'd have to add: I wish future generations all the luck in sorting through our garbage piles and recycling/salvaging what they can. If anything, this amount of waste - or crap - is a record of us as much as anything. I can agree with you on this point about crap in our culture!!!
Parent
Re: (Score:3, Insightful)
I'll wager you could reconstruct far more about the culture of early 21st century from the contents of a convenience store than that of the White House. There's a big gulf between who a people are and the mask they present to the world.
Speaking of trash... (Score:3, Funny)
Re:I've heard this problem over and over (Score:4, Insightful)
If preservation is outlawed, only outlaws will be preservationists.
I believe Ray Bradbury had something to say on this subject.
KFG
Parent
Extra irony points. (Score:5, Insightful)
Perhaps more ironic -- it's a pretty good bet that whatever he wrote on the subject, it's not available online due to copyright restrictions imposed by his publisher or "estate."
Parent
Re:Extra irony points. (Score:4, Funny)
KFG
Parent
Re: (Score:3, Insightful)
Huh. So the FSF will win by default. You gotta hand it to somebody who is willing to play the long game.
Re:I've heard this problem over and over (Score:4, Funny)
Working at a University, this is not a subject I'm not unfamiliar with. We've had lots of discussions about this. Everyone always talks about how many zillions of "pieces of information" are out there. The number of web pages in existence is always brandied about.
Where can I attend these meetings, where people speak in triple negatives and much brandy is available?
Parent
My solution for digital photos? (Score:4, Informative)
Storage itself? Currently burning onto Delkin Archival Gold [delkin.com], storing cool and dark, and in two physically distant locations.
They're also stored on my harddisk, and the best are backed up onto a USB drive.
If it looks like the DVD-ROM drive is becoming obsolete I'll burn them on to whatever comes along next.
If you're truly paranoid you can always print them on archival quality paper using pigment based inks
Re: (Score:3, Insightful)
Open, well-used, file formats. (Score:5, Insightful)
For text documents, HTML is probably the best bet. It is so widely used and supported readers are almost garunteed to exist as long as computers do in their current form. (And if something ever truely supersedes it, a mass-conversion program will be written anyway.) HTML probably works for basic spreadsheets too. Graphics support for GIF, JPEG, and PNG is probably at that level as well, and MP3 for music.
As a bonus, most of the native programs for the documents to be preserved have translators to these formats already.
Beyond that I have no idea.
Re: (Score:3, Interesting)
Popular Mechanics asks... (Score:4, Insightful)
Obviously not; Popular Mechanics itself has preserved much of the era in traditional hardcopy formats, making it no less lossy than previous printed-word eras.
Of course, understanding the era from such incomplete and unreliable records will be a challenge to archaeologists and historians; again, not much different from previous eras.
In conclusion: doesn't matter, hardly news.
Government Area of Expertise (Score:5, Funny)
I'd trust that guy. If there's one thing our governrment knows, it's stupidity.
The solution (Score:4, Interesting)
The real problem is 60 years of archives of antiquated, proprietary, task-spcific and mainframe computer data cards and tapes whose original programmers are halfway to cedar boxes; if the government can't get their support in time it may as well call all the early stuff a loss and hand it over to archaeologists.
It's whether it's WORTH it (Score:5, Insightful)
It really isn't a question WHETHER we will be able to read old digital data in the future. After all, humans invented these formats, flawed as they may be, and humans can decipher them with enough effort. We can crack cryptography -- a deliberate attempt to make it as difficult as possible to decipher certain information. So it's hard to imagine any data format that could not be deciphered in the future with some honest effort.
Instead it is a question of whether the data is WORTH the effort. From an anthropological standpoint, this is valuable historical data, and its value is not decreased by our inability to interpret it. The benefit of digital data is that it can be copied even if we don't know what it means. It will not erode or decay like other historical artifacts, if we put in the small effort required to preserve it. Assuming humanity doesn't self-destruct, there will be plenty of time in the future for historians to decipher and interpret the data when a need arises for it.
Stuff I can't read (Score:3, Interesting)
UK/BBC Domesday book (Score:3, Interesting)
Well, 15 years on, it was useless. The then-proprietary format was not readable on anything modern, and there was not much of the old hardware around either. You can google for it ("UK domesday bbc data" should do it), the first link I saw was on the Guardian Online [guardian.co.uk].
I've still got stuff on floppies, but no-one builds PCs with them anymore. I've got two old laptops with floppy drives, the other three computers have none. (OK, I also have two corpses with floppy drives, and the controllers on two of the new PCs will accept floppy drives, but, please take my point - they're going out of fashion.)
In 20 years time, there will probably be no CD/DVD drives, we'll all be using a new more portable, more backupable, lighter, faster, probably online-only storage medium. Kids won't recognize laserdisks, floppies, or USB ports. They might not recognise keyboards either - who knows?
Reverse engineering (Score:3, Insightful)
Obligatory quote ;) (Score:3, Funny)
Long Time Gone (Score:4, Insightful)
I ask: has this ever happened before?
Not necessarily in electronic bits and bytes. Not the "Alexandria Library" that was mostly duplicated in other libraries or private collections. Maybe like the Inca quipu, mats of knotted strings that recorded all their empire's operational records, other than the ceremonial records in statues and murals. But some quipu survive, despite Spaniards destroying most of them in the mid-1500s. Enough that we can at least recognize that they did have records of lots of transactions.
No, something more transient, as transient as our bits, read/written by something more transient than our metal/plastic/glass machines. Maybe songs or other performed stories, like tribal Australians. Maybe woven in more degradable material, like uncured plant matter. Maybe both, like the Pacific star navigation lore taught in temporary woven stics, but carried in the mind. Maybe patterns in some other loseable medium, like animal pelt patterns no longer readable now that the code has been lost, or interbred back into "blankness".
If it can happen to us, it could have happened before. Our civilization rose from meager beginnings only about 12K years ago, after the last Ice Age that lasted about 12Ky. There was another one before that, with people accumulating knowledge between. And probably a half-dozen or so others since we became as genetically developed as we are today, between 7Mya and 200Kya. We don't even have many records from the first half of the last 12Ky. Could we be reinventing the wheel, literally, every 25 thousand years?
The Waste Isolation Pilot Plant site marking (Score:4, Interesting)
While the WIPP site won't have the benefit of constant updating of the media (it's designed to be survive on its own for 10,000 years) it does address some of the same points; longevity of the media, a format that will be usable into the future, and ability of future civilizations to understand the message.
Off-topic perhaps but an interesting read.