Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage

Archiving Digital History at the NARA 202

val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
This discussion has been archived. No new comments can be posted.

Archiving Digital History at the NARA

Comments Filter:
  • by gardyloo ( 512791 ) on Sunday June 26, 2005 @04:30PM (#12916098)
    Hm. This sounds like a job for OpenOffice...
  • 347 petabytes? (Score:5, Insightful)

    by ravenspear ( 756059 ) on Sunday June 26, 2005 @04:33PM (#12916109)
    Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?

    I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.
    • I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack,

      Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archi
    • Re:347 petabytes? (Score:4, Informative)

      by CodeBuster ( 516420 ) on Sunday June 26, 2005 @05:00PM (#12916247)
      The most common structure used to index large amounts of data stored on magnetic or other large capacity media is the B-Tree and its variants. The article linked here [bluerwhite.org] explains the basic idea of the balanced multiway tree or B-Tree. The advantage of this type of index is that the index can be stored entirely on the collection of tapes, cartridges, disks or whatever else while only the portion of the tree which currently being operated on need be read into volatile or main memory. The B-Tree allows for efficient access to massive amounts of data while minimizing disk reads and writes. Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.
    • How many Libraries of Congress is that?
    • Ok, I was tempted to make a pr0n joke about this

      Note that they don't say which mailbox in the Clinton administration...
  • Can't they get more storage performance out of their system by (more) aggressively compressing old information? That shouldn't matter too much to the indexing mechanism. Also, it might make sense to tag the importance of different documents so that its compressing/archiving treatment can depend on that.
  • by divide overflow ( 599608 ) on Sunday June 26, 2005 @04:37PM (#12916130)
    It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.
    • by tabdelgawad ( 590061 ) on Sunday June 26, 2005 @05:03PM (#12916261)
      Actually, it's more like 'inevitable'. I'll bet almost everyone has unintentionally lost digital data permanently and will do so again in the future.

      The key, I think, is prioritization. We all do it individually (important stuff gets backed up many times and often, unimportant stuff perhaps never backed up), and NARA will have to do it too. I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance. The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small.
      • I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance.

        I agree really but I also find the problem with data is you never know until its too late. The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.

        I feel that if you can't completely automate backups then the best thing is to give users easy access to backup resources for their own material
        • The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.

          I gree with this completely.
          The article mentioned the selective retention of information as one possibility for coping with the massive amounts of data that need to be preserved.
          I think that it would be a mistake to do this.
          IMO, all data should be archived in bulk as soon as possible, and then scholars can work on indexing those portions that they deem impor

      • I think it also has to do with the fact that the media in which we store information are increasingly less durable (compare stone engraved millenia ago, writings in paper of past centuries still readable today, and the relatively short life expectancy of magnetic and optical media).

        Now I'm not saying we should all go back to Stone Age, but it does make you think about the irony of progress...
    • It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.

      True. But I hardly think Alexandria was lost to the tap of the Y key, a pregnant pause, then an "oops."
      • No, but it could have been lost to the strike of flint, a pregnant pause, then an "glukús theométôr" (Sweet Mother of God, for you people that suck). (Note: I spent like 20 minutes transliterating that to Latin just so I could post it on /. because it hated the Greek charset. I have no life.)
  • by reporter ( 666905 ) on Sunday June 26, 2005 @04:39PM (#12916140) Homepage
    National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

    Perhaps, the answer is compression.

    Does anyone know whether there is an upper limit to text compression?

    In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?

    Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.

    If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.

    • There is no theoretical upper limit on text compression as far as I know (and I'd be rather surprised if there was [1]), but there *is* a (very basic) theorem from Kolmogorov complexity that says that there's always data that can't be compressed for any compression algorithm you devise (for a proof, simply consider the number of strings of length =n for a given n).

      1. Well, I'd be surprised as long as you don't make any assumptions about the statistical distribution of bits in the text you want to compress.
      • There is no theoretical upper limit on text compression as far as I know

        Which is obviously some hot gas coming from your posterior. Otherwise: 1 (the Holy bible, heavily compressed)

        The amount of compression possible in a given string of numbers is inversely proportional to the amount of randomness in the input.
        • The problem with your glib answer, is that "1" being the Holy Bible as compressed, is completely legitimate. It's be incredibly useful assuming the entire contents of the bible occurs often in whatever it is you are compressing. It's essentially the concept behind huffman encoding (it's not exactly the same, but picking the most common letters from your symbol set and encoding them as very short binary strings is the basic princepal).

          Depending on how specialized your data is, it might be a net win to do

          • 1 if by land... that's not a compression scheme, that's an indexing scheme.

            Speaking of which, don't we have to consider indexing this megalith? And if things haven't changed *that* much since I was a DBA, you can easily have indexing that takes ten times the storage of the raw data itself. Better factor that in, too.

            • Toe-mato, Tha-mato (that works better when spoken). It is in fact a compression scheme, and an indexing scheme.

              I can easily thing of it as a compression scheme. If they wanted to have it communicate all of that information they could have devised "Morse Code", and actually spelled it out. This is obviously much shorter. The code they specially designed for this single use was exactly as described.

              You can think of it as an indexing scheme if you feel like it, but that doesn't mean it's any less legi

        • That's perfectly legitimate compression, if in your scheme "1" is actually equivilant to the Bible. 11 would then be a nice shorthand, highly compressed way of writing the Bible twice, back to back. 12 might mean the Bible, then the Koran.

          Such a scheme wouldn't be very useful for general use, of course ...
    • Sounds a bit like 42, it'll tell us the answer, but we need something else to find the question.
    • The only thing that comes to mind is information entropy [wikipedia.org]. If you're given a text document, you can determine the probability distribution for each letter, letter combinations, for words, or whatever you can think of. Then given the probability distribution, you can determine the information entropy. If, in the sum, you use log with base 2 then H(x) (see formal definitions [wikipedia.org]) gives you the entropy in bits.

      For example, if you have a text file with letters of equal probability (all letters have a probability
    • Actually there is an upper limit...
      It is some of Shannon's work on Information Theory.
      Basically, information has entropy associated with it. Entropy being the randomness of information. Truly 100% random information cannot be compressed.
      The central idea has to do with the probability of something occuring.
      Text compresses quite well because certain letters are more common than others and there are a limited number of symbols. (e for example)
      If i encode e using 1 bit instead of 8 that saves 7 bits.

      This is th
    • entropy (Score:2, Informative)

      You can calculate the amount of entropy in a document (text or no) and that is a limit to how small you could possibly make it.

      I don't recall how close modern methods like arithmatic encoding make it to that limit, but I know it's close enough that we couldn't double the compression ratio of text documents from the current state of the art.

      Trellis coding is a system for dealing with induced errors in modem signalling. It allows you to cancel some of them out. It doesn't actually increase the throughput in
    • Just make sure you have a portable and open compression format that you will be able to dig up in 50 years. I have a ton of old data that I backed up in MacOS System 7ish using an old version of Stuffit that did automatic compression in the background (I think it was called Space Saver or the like). Well, it was a really dumb idea of me to install that because that data is inaccessible to me without running an old copy of the MacOS (though perhaps classic would work) and digging up that particular version
    • Does anyone know whether there is an upper limit to text compression?

      That, ofcourse, strongly depends on the entropy of the text to be compressed. When you're talking about the current president's email, well, there can't possible be a whole of entropy in there, so it should be really easy to compress.
  • ha (Score:3, Funny)

    by The Big Ugly ( 738455 ) on Sunday June 26, 2005 @04:40PM (#12916145) Homepage
    "Archiving Digital History at the NARA"

    You'll have to pry it from my cold, dead hands!

    Ohhhh, NARA, not NRA....
  • Retain it all. (Score:2, Insightful)

    by d3m057h3n35 ( 695460 )
    Perhaps it would be best to keep it all, even the stuff that now may seem totally useless, like Clinton administration emails from Janet Reno to Madeleine Albright asking what she thinks about Norman Mineta and his "hot Asian vibe." With search technology improving constantly, it would probably be better than throwing stuff away which could potentially be of interest, or spending time developing the AI to make the task less time-consuming. And besides, we can't make future historians' jobs too easy. They've
  • by feloneous cat ( 564318 ) on Sunday June 26, 2005 @04:45PM (#12916172)
    With the new GoogleNARA...

    nara.google.com

    Oh, wait... I'm getting ahead of myself...
  • by HermanAB ( 661181 ) on Sunday June 26, 2005 @04:47PM (#12916187)
    In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.
    • The trick is to get your data infrastructure organised to start with. Because I have a predetermined system for organising my class notes (Microsoft OneNote, so shoot me) I can reliably pick out notes from a specific class based on date, or topic based on exam questions, or I can take the Google approach and just go "Find me anything to do with this".

      The information I need is preserved in an easily accessible form because I made a decision to make all my class notes organised, and as a result I've replaced
    • Hmm... Assuming that google/yahoo save all of the queries anyone ever does (over the years), just index the -entire- NARA database using google, and then run it against -all- the queries anyone has bothered to run in the last 5 years. Whatever files do -not- come up in the first 1000 results, can be safely deleted :-)

      Just an idea...
    • Of course there is a difference between data and information, but it seems quite clear that only important information is being preserved.

      In the story they talk about multiple revisions of word documents written by leaders, and photos of the effects of agent orange. Do you consider those things "crap"?

      The fact is, the government is huge, and there is a hell of a lot of important information to be saved over the years.
      • for example: Multiple revisions...
        • The changes made in the process of writing a document are almost as important as the end product. Just look up the drafts of the founding documents, and see all that changed from the start to the final draft. A significant ammount of historical information would have been lost if we did not have those revisions.

          Besides that, revisions are very, very small, so it's not as if storage is a real problem. When your 500GB hard drive is full, you don't go through and delete all your unneeded text files first,
  • Dark Ages (Score:5, Insightful)

    by TimeTraveler1884 ( 832874 ) on Sunday June 26, 2005 @04:50PM (#12916198)
    Are we destined for a "digital dark age"?"
    If by "dark age" you mean a time in human history where more information is recorded than ever, yes I suppose we are.

    I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.

    Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.

    Personally, I take the "you never know" ideology and save everything.
    • I think it may be worse than that- that there will be a huge proliferation of false information, sensationalistic 'infotainmnet,' advertising, propaganda, etc... Why, historians of the future may be depending on /. as their main source of of information! Think of what a tragedy that would be!
      • You jest, but it's possible something like Wikipedia or (shudder) everything2 will be on some future historian's list of sources.

        So historians in 2100 will have to wade through various trolls and defacement attemps to try to get what people thought about in 2005 - but at least they'll know not to click on Goatse links [wikipedia.org].
    • "Personally, I take the "you never know" ideology and save everything."
      That's a good ideology, because I'm sure we'll develop an AI that would be more than happy to deep search this data someday and shed light on some history we never knew about. It could be very interesting.
  • by G4from128k ( 686170 ) on Sunday June 26, 2005 @04:51PM (#12916206)
    Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.

    Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).

    By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.

    To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.
    • I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.
      • I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.

        Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor
  • by gus goose ( 306978 ) on Sunday June 26, 2005 @04:54PM (#12916217) Journal
    People should think outside the box.

    The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

    Bottom line: with the 100 + 38 million dollars (FTFA) assigned to the project I am sure I could eliminate a number of redundant positions, optimise some communication channels, retire voluminous individuals, replace inefficient protocols/people, and basically reduce the sources of data. Hell, if the US were to actually have peace instead of demand it, there would be a much reduced need for military inteligence, political rhetoric, and other civil responsibilities. The military could be half the size, and what do you know, we could not only reduce the requirement for archiving, but could actually save money in the process.

    Remeber, govenment is a self-supporting process.

    Go ahead, mark me a troll.

    gus
    • ...other techniques did whittle down the execution time to about 5 hours, the final solution ...is now 2 hours.

      That's only a 60% reduction. A 60% reduction of 347 PB is still 138.8 PB...still a huge archival task.

      Keep 1% of the data still leaves you with 3.47 PB. Not impossible, but still a daunting task.
    • The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

      I'm sure I could do that in
  • by Leontes ( 653331 ) on Sunday June 26, 2005 @04:54PM (#12916223)
    The ancient, esteemedgreat library of alexandria [wikipedia.org] was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?

    Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

    Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.

    It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?
    • Doesn't it diminish the aura of a great work of art if you know that it can always be restored from a backup?
    • by mcrbids ( 148650 ) on Sunday June 26, 2005 @05:57PM (#12916591) Journal
      Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

      Absolutely, yes!

      History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.

      One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.

      History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.

      The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?

      Those memos and IMs comprise that understand of people today.
      • I'm not convinced that people want to know the truth of what happened if it doesn't speak to their own specific zeitgeist. Why does it matter some what some joe w. schmuck dies in office does when there is a john q. public living the prototypical existence outside the hallowed halls of policy?

        It is only through examining the artistry, the great works, the monuments that withstand the test of time possibly, for those are the things which were the attempts of that culture to enrich themselves. Perhaps in
    • Hello, History? You are going to judge these people, aren't you?
  • So? (Score:3, Insightful)

    by ArchAngel21x ( 678202 ) on Sunday June 26, 2005 @04:58PM (#12916238)
    By the time the government comes up with a half ass solution, archive.org will already have it all organized, online, indexed, and backed up.
  • anybody know what the government has spec'ed TFA's archiving system to do? It says it will need to read 16,000 file formats, and be impervious to terrorist attack (?), but not much else...

    I wonder what kind of searches and cross-linking will be done, for instance. What kinds of access control there will be? I'd also just like to see what the 16,000 formats are, out of curiosity. Sounds like a project waaaay larger than the $136 million they've allotted for it so far.

    Stupid name.... i'm guessing they w
  • by pangloss ( 25315 ) on Sunday June 26, 2005 @05:06PM (#12916273) Journal
    http://www.fedora.info/ [fedora.info]
    (Not to be confused with the Linux distribution)

    From the website, Fedora is "a general purpose repository service...devoted to...providing open-source repository software that can serve as the foundation for many types of information management systems".

    Problem for some is that Fedora can be a little hard to grok. It's not an out-of-the-box repository to install and run, like the repository application mentioned in the article (DSpace). It's an architecture for building repository software. Once you understand the potential for building applications on top of Fedora, you start to see some light at the end of the tunnel for just the sort of issues the article raises.

    • I work on a digital humanities project (and I also work down the hall from the Fedora folks). We are in the process of ingesting our 20,000+ object repository into Fedora. Most of it involves XSL acrobatics, but I'll spare the details.

      Fedora is oriented toward digital library work, which I suspect has some carry over with archival work at NARA. They would be wise to look at it, but I'll say from our personal experience, it is a major task to get our materials into Fedora. I don't mean this in any way t
  • by Council ( 514577 ) <rmunroe AT gmail DOT com> on Sunday June 26, 2005 @05:11PM (#12916290) Homepage
    Here is a relevant post by Ralph Spoilsport [slashdot.org] on an earlier article, which can be found here [slashdot.org]. I am reproducing it here in full because it is very interesting and highly relevant.

    this is actually a BIG question

    And one that I have railed about for many years.
    I have been in the same position the Author discussed, and I have come to ONLY negative conclusions. In a few words, and I hate to say this, but buddy:

    WE'RE FUCKED.

    Digital is a loser's proposition. backing up to analogue or even digital data on analogic substrates (such as DV tape) fail. Simply nad purely.

    The *only* thing that comes close is some kind of RAID, and those, even with the plummeting price of storage, are still too expensive given the needs.

    Also, a RAID assumes a continuity of several things that are not likely to be continuous:

    With Video:
    Framerate, number of lines, colour depth, aspect ratio, file format, compression format, Operating system compatibility, etc etc etc. All of these things are variables.

    With Audio:
    sample rate, compression format, bit depth, file format, etc.

    Basically all of it points to very bad places.

    I am fairly well convinced that our age will simply disappear. They will find our garbage, the few books not pressed on acidic paper, our paintings (fat lot of good the abstract stuff will mean to them) and drawings, that's about it. the rest will just be shiny little bits of crap in the landfill.

    Since we will have used up all the dense energy forms, they will be appalled at the energy requirements just to get the few remaining museum piece devices to work. Archiving the 21st century will be impossible. To the 25th century, the 21st century will be seen as a dark age - not only for the holocaust of the die caused by the failure of the petroleum based economy, but from the simple fact that very little of the information formats we are totally geared into will survive, including this note on /.

    His problem of saving personal video is just the tip ofthe iceberg. His problem is the problem of our very civilisation, writ small.

    That's why I am abandoning video, and going back to painting. In 500 years, my painting CAN survive. the video simply won't.

    RS


    And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.

    What do you think about Ralph's thoughts?
  • by mrogers ( 85392 )
    Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century? Archiving everything is an attractively simple approach, but if it turns out to be impractical we can always fall back on common sense and restrict ourselves to archiving the maybe 10% of things that have even a remote chance of being interesting in 100 years' time.
  • by Doc Ruby ( 173196 ) on Sunday June 26, 2005 @05:17PM (#12916329) Homepage Journal
    We need to imprint holographic storage on synthetic diamonds. Even if they're slow and expensive, they'll last even longer than the paper records they replace. We'll have to spend a fortune redigitizing all the polymer (CD/DVD, floppy, tape), celluloid (microfilm/fiche) and rotating (disc) media that will age to illegibility within our lifetimes. Until we get holographic gems, we need to archive everything on paper, including those expiring media, in a format easily digitized to a more permanent medium. But of course the government, and barely unaccountable bosses, want the public record to disappear down the memory hole. If they could accelerate the process, including newspapers, they'd spend everything we've got (and more) to make it happen.
  • Records (Score:3, Informative)

    by Big Sean O ( 317186 ) on Sunday June 26, 2005 @05:17PM (#12916330)
    NARA makes a distinction between a document and a record. Any old piece of paper or email is a document, but a record is something which shows how the US government did business.

    For example, the email to my supervisor asking when I can take a week's vacation isn't a record. The leave request form I get him to sign is a record. An email about lunch plans: not a record. An email to a coworker about a grant application probably is.

    Besides obvious records (eg: financial and legal records), there are many documents that may or may not be records. For the most part, it's up to each program to decide which documents are records and archive them appropriately.
  • by 1nv4d3r ( 642775 ) on Sunday June 26, 2005 @05:30PM (#12916397)
    I'm not sure most of this stuff is worth making preserving digitally enough to justify the cost. Just print em out, and put them in a Raiders of the Lost Ark-style warehouse. The few people who want to see all of clinton's administration's emails can travel to it and search.

    I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records.

    .
  • by rduke15 ( 721841 ) <(rduke15) (at) (gmail.com)> on Sunday June 26, 2005 @05:31PM (#12916418)
    I don't know about the NASA data sets, but they could certainly save a few petabytes by stripping the stupid HTML part of all Outlook emails...
  • by G4from128k ( 686170 ) on Sunday June 26, 2005 @05:40PM (#12916477)
    In 1987, a Mac II came with a 40 MB drive. 17 years later, a PowerMac G5 came with 160 GB drive. This was at least 4000X improvement in storage density and price (and 1987's drive was both physically larger and more expensive than 2004's drive).

    Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.
    • First, Moore's Law is about transistor density, which has nothing to do with hard drives. Secondly, hard drives haven't been getting any more reliable. That means all these hard drives have to be replaced every few years. It's a nightmare for long-term storage.
      • First, Moore's Law is about transistor density, which has nothing to do with hard drives. Secondly, hard drives haven't been getting any more reliable. That means all these hard drives have to be replaced every few years. It's a nightmare for long-term storage.

        You are right -- Gordon Moore spoke only of trends in the number of transistors/IC. Yet his law was, if anything, about advances in the technologies of miniaturization. This miniaturization has had profound, indirect effects on storage. The same
  • by dpbsmith ( 263124 ) on Sunday June 26, 2005 @05:41PM (#12916484) Homepage
    The Zapruder film was the beginning. In recent years, I've been dumbfounded by the vast extension in recording and documentation of things like crimes in progress, natural disasters, America's Funniest Home Videos, you name it. A plane crashes, and the next day there are ten different home videos from people in the vicinity who had camcorders.

    I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.

    And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.

    Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.

    So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.

    But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.

    It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.
  • In 2022, we'll probably have terabyte capacity in our mobile phones. Seriously. In the early 90s, 80 Gb of drive space ran about $80,000 according to this archived historical document. [wired.com] Nowadays, I can get an 80 Gb drive for about $65 according to froogle, [google.com] and that's without considering inflation. Sure at a conservative $1/Gb were looking at $347 million dollars today, but in 17 years time that'll probably look more like two or three hundred thousand bucks. No biggie for our bloated government.
  • by DarkEdgeX ( 212110 ) on Sunday June 26, 2005 @06:20PM (#12916705) Journal
    NARA needs to open up tons and tons of GMail accounts. Where do I send my invites so I can contribute?
  • It doesn't matter whether it is on paper or digital media. If someone isn't willing to spend the money to preserve it, it will be lost. I've seen decades worth of project records and file libraries end up in the land-fill because there was no budget or requirement for preserving them. It's sad to see the products of many years of work by talented people discarded like so much trash.

    To add insult to injury, slime-sucking lawyers now advise their clients to destroy records, like email, as soon as possible t

    • heh, its funny cuz its true... i've "burned" many an email message once a contract was complete and the money was in my pocket... its safer for me to put the money in my pocket and then deny all contact with a company than to actually archive all transactions if someone decides they do not like me any more. IE. i remember one client who threatened legal action if i did not update his website after actually making said website for a certain fee... after asking him to prove it and due to a HD crash it was bet
  • or how many Volkswagon Beetles filled with DAT tapes?

    or how many beowulf clusters are needed to search it? sort it? :^)
  • How expensive is data storage, really ? I'll design a ten petabyte (10PB) storage system. You'll see how much it costs. To build this monster machine, I'll be using commercial off-the-shelf hardware organised as a massive Linux cluster.

    You may ask "why do you want to build the most powerful Beowulf cluster on Earth when storage companies have all these amazing storage systems ?" Well, this system needs to be an open solution. The system will need to grow and evolve as the needs change. Vendor lock-in is si
  • here:

    http://slashdot.org/comments.pl?sid=154005&cid=12 9 17603 [slashdot.org]

    My opinion?

    The 21st century will disappear from history. In 500 yearstime they will know more about Italy of 1505 than the USA of 2005. Why? the records of Italy will still exist.

    The entire digital info system is based on the free ride of petroleum. Petroleum will basically disappear from society fairly soon, (either it will simply deplete, or will become too expensive to drill it out) and everything made of plastic and anything re

  • The people at the LHC have been planning for large data rates and storage requirements for quite a few years.

    The computational and data-storage requirements for the LHC experiments will be staggering, according to Jamie Shiers, leader of the Database Group in CERN's IT division. "We project 5 to 8 petabytes [PB] of data will be generated each year, the analysis of which will require some 100PB of storage [of which a large fraction will hopefully be online] and more computing power than that supplied by t

  • If the current administration has its way, we have no business archiving anything.

    One of GWB's first acts was to lock down the Reagan administration's (and, all subsequent administration's) data forever. The 12 year release cycle that the Ford Administration approved was revoked within weeks of Jan 2000 (some cynics say, to prevent data about Iran-Contra and GHWB's involvement becoming public - but that's just crazy talk).

    The only data less available than old parchment in a vault is random magnetic domain
  • FTFA:
    "A new avalanche of records from the Bush administration--the most electronic presidency yet--will descend in three and a half years, when the president leaves office."
    The Bush Administration is also the most secretive presidency yet. It would certainly be interesting to be on the IT staff "archiving" that set of data. The IT boss would be amazed out how much free overtime is staffers were willing to do in the middle of the night...
  • I suspect 99.99% of this information is multiply redundant. With a good compression algorithm, it would fit onto a DVD or a CD or perhaps even a floppy.
  • I guess the first question is, why are even keeping this data around. Give the historians something to argue about and delete some stuff.

"Imitation is the sincerest form of television." -- The New Mighty Mouse

Working...