Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Data Storage

Archiving Digital History at the NARA 202

val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
This discussion has been archived. No new comments can be posted.

Archiving Digital History at the NARA

Comments Filter:
  • 347 petabytes? (Score:5, Insightful)

    by ravenspear ( 756059 ) on Sunday June 26, 2005 @05:33PM (#12916109)
    Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?

    I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.
  • by divide overflow ( 599608 ) on Sunday June 26, 2005 @05:37PM (#12916130)
    It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.
  • Retain it all. (Score:2, Insightful)

    by d3m057h3n35 ( 695460 ) on Sunday June 26, 2005 @05:42PM (#12916156)
    Perhaps it would be best to keep it all, even the stuff that now may seem totally useless, like Clinton administration emails from Janet Reno to Madeleine Albright asking what she thinks about Norman Mineta and his "hot Asian vibe." With search technology improving constantly, it would probably be better than throwing stuff away which could potentially be of interest, or spending time developing the AI to make the task less time-consuming. And besides, we can't make future historians' jobs too easy. They've gotta earn their pay, reminding us of the banalities of this age.
  • by HermanAB ( 661181 ) on Sunday June 26, 2005 @05:47PM (#12916187)
    In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.
  • Dark Ages (Score:5, Insightful)

    by TimeTraveler1884 ( 832874 ) on Sunday June 26, 2005 @05:50PM (#12916198)
    Are we destined for a "digital dark age"?"
    If by "dark age" you mean a time in human history where more information is recorded than ever, yes I suppose we are.

    I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.

    Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.

    Personally, I take the "you never know" ideology and save everything.
  • by G4from128k ( 686170 ) on Sunday June 26, 2005 @05:51PM (#12916206)
    Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.

    Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).

    By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.

    To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.
  • by gus goose ( 306978 ) on Sunday June 26, 2005 @05:54PM (#12916217) Journal
    People should think outside the box.

    The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

    Bottom line: with the 100 + 38 million dollars (FTFA) assigned to the project I am sure I could eliminate a number of redundant positions, optimise some communication channels, retire voluminous individuals, replace inefficient protocols/people, and basically reduce the sources of data. Hell, if the US were to actually have peace instead of demand it, there would be a much reduced need for military inteligence, political rhetoric, and other civil responsibilities. The military could be half the size, and what do you know, we could not only reduce the requirement for archiving, but could actually save money in the process.

    Remeber, govenment is a self-supporting process.

    Go ahead, mark me a troll.

  • So? (Score:3, Insightful)

    by ArchAngel21x ( 678202 ) on Sunday June 26, 2005 @05:58PM (#12916238)
    By the time the government comes up with a half ass solution, archive.org will already have it all organized, online, indexed, and backed up.
  • by tabdelgawad ( 590061 ) on Sunday June 26, 2005 @06:03PM (#12916261)
    Actually, it's more like 'inevitable'. I'll bet almost everyone has unintentionally lost digital data permanently and will do so again in the future.

    The key, I think, is prioritization. We all do it individually (important stuff gets backed up many times and often, unimportant stuff perhaps never backed up), and NARA will have to do it too. I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance. The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small.
  • by mrogers ( 85392 ) on Sunday June 26, 2005 @06:12PM (#12916298)
    Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century? Archiving everything is an attractively simple approach, but if it turns out to be impractical we can always fall back on common sense and restrict ourselves to archiving the maybe 10% of things that have even a remote chance of being interesting in 100 years' time.
  • by kfg ( 145172 ) on Sunday June 26, 2005 @06:21PM (#12916341)
    Every mail is great
    If a mail is wasted
    The gods get quite irrate

    Every mail is wanted
    Every mail is good
    Every mail is needed
    In your network neighborhood

    Really, the idea of not being able to record and save every post-it note being equated with those times and places where writing itself was denigrated into virtual nonexistence is a bit silly.

  • Actually, one of the main complaints Historians have is incomplete information about the past. Not having every little tidbit makes it impossible to figure out how people actually lived. History _should_ be more than just names, dates, and events. If we can properly preserve and index items that seem really mundane to us, future generations have a _much_ better chance of having some real understanding of how we developed as a society.
  • by 1nv4d3r ( 642775 ) on Sunday June 26, 2005 @06:30PM (#12916397) Homepage
    I'm not sure most of this stuff is worth making preserving digitally enough to justify the cost. Just print em out, and put them in a Raiders of the Lost Ark-style warehouse. The few people who want to see all of clinton's administration's emails can travel to it and search.

    I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records.

  • by Anonymous Coward on Sunday June 26, 2005 @06:30PM (#12916407)
    The problem is that it can be hard to know where the boundary between important and useless is...

    Things that previous generations considered unworthy of preservation are things that are greatly treasured in today's age - look at all the old manuscripts of which we only have a few pages (because scribes reused the parchment). Look at the masterpieces that were painted over to save canvas.

    As soon as you start to put hard limits down on what to preserve, and what to leave alone, we risk losing information that our next generations will value.

    Besides - in many cases, it could just be easier to save everything. It seems that trying to enforce standards and judging what should and shouldn't be preserved might be more labour-intensive than the alternative. Considering the rate at which informationis generated it might make sense to have a trade-off between conserving storage versus conserving labour... storage is easier/cheaper/more available :)
  • ...other techniques did whittle down the execution time to about 5 hours, the final solution ...is now 2 hours.

    That's only a 60% reduction. A 60% reduction of 347 PB is still 138.8 PB...still a huge archival task.

    Keep 1% of the data still leaves you with 3.47 PB. Not impossible, but still a daunting task.
  • by mcrbids ( 148650 ) on Sunday June 26, 2005 @06:57PM (#12916591) Journal
    Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

    Absolutely, yes!

    History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.

    One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.

    History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.

    The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?

    Those memos and IMs comprise that understand of people today.
  • by fmaxwell ( 249001 ) on Sunday June 26, 2005 @07:16PM (#12916691) Homepage Journal
    I think that he's being absurdly pessimistic. Those future historians will have no more difficulty reading our media than we have playing the sounds from a wax cylinder for an Edison phonograph. Running archived computer software will be no different than any of us using a software emulator to run a game for a long-dead gaming console.

    Sure, optical and magnetic media decay, but there's nothing stopping people from "refreshing" the media before it decays too far. If you have a stack of CD-R discs that are starting to show an increase in correctable errors, then you back them up to new CD-R discs or to DVD-R. You don't have to sit idly by and watch them decay. I've got a CP/M computer with that's over 20 years old and it can still boot from its 10MB (yes, megabyte) hard drive. So it's not like data just disappears five years after it's recorded.

    It's also a problem which is being addressed by the industry. There are companies offering long-life CD-R media designed for archival [hi-space-pro.com]. Other companies offer data storage for archival data, much like the climate-controlled vaults where countless audio master tapes and films have been stored for decades.

    In closing, I think that 95%+ of archived data will still be able to be accessed in a century -- provided that it is properly stored and cared for.

  • You were just a little over 12 times too much. Let's just hop you don't write code for a living :p [...]

    To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.

Thus spake the master programmer: "When a program is being tested, it is too late to make design changes." -- Geoffrey James, "The Tao of Programming"