Forgot your password?
typodupeerror
Data Storage

Archiving Digital History at the NARA 202

Posted by timothy
from the sort-and-toss dept.
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
This discussion has been archived. No new comments can be posted.

Archiving Digital History at the NARA

Comments Filter:
  • by reporter (666905) on Sunday June 26, 2005 @04:39PM (#12916140) Homepage
    National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

    Perhaps, the answer is compression.

    Does anyone know whether there is an upper limit to text compression?

    In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?

    Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.

    If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.

  • Re:347 petabytes? (Score:3, Informative)

    by OrangeSpyderMan (589635) on Sunday June 26, 2005 @04:54PM (#12916219)
    I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack,

    Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archive anyway, but i'd guess you'd know that if you'd ever archived anything....
  • Re:347 petabytes? (Score:3, Informative)

    by ravenspear (756059) on Sunday June 26, 2005 @04:56PM (#12916231)
    Well considering that Spotlight took about 2 hours to index my 120 GB drive, that would be (347 * 1024^2) * 2 = 72771174 hours = 83,000 years to index that much data.

    Now I'm sure the gov would use a faster system than my laptop, but still!
  • Re:347 petabytes? (Score:4, Informative)

    by CodeBuster (516420) on Sunday June 26, 2005 @05:00PM (#12916247)
    The most common structure used to index large amounts of data stored on magnetic or other large capacity media is the B-Tree and its variants. The article linked here [bluerwhite.org] explains the basic idea of the balanced multiway tree or B-Tree. The advantage of this type of index is that the index can be stored entirely on the collection of tapes, cartridges, disks or whatever else while only the portion of the tree which currently being operated on need be read into volatile or main memory. The B-Tree allows for efficient access to massive amounts of data while minimizing disk reads and writes. Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.
  • Records (Score:3, Informative)

    by Big Sean O (317186) on Sunday June 26, 2005 @05:17PM (#12916330)
    NARA makes a distinction between a document and a record. Any old piece of paper or email is a document, but a record is something which shows how the US government did business.

    For example, the email to my supervisor asking when I can take a week's vacation isn't a record. The leave request form I get him to sign is a record. An email about lunch plans: not a record. An email to a coworker about a grant application probably is.

    Besides obvious records (eg: financial and legal records), there are many documents that may or may not be records. For the most part, it's up to each program to decide which documents are records and archive them appropriately.
  • by zysus (123604) on Sunday June 26, 2005 @07:40PM (#12917052) Homepage
    Actually there is an upper limit...
    It is some of Shannon's work on Information Theory.
    Basically, information has entropy associated with it. Entropy being the randomness of information. Truly 100% random information cannot be compressed.
    The central idea has to do with the probability of something occuring.
    Text compresses quite well because certain letters are more common than others and there are a limited number of symbols. (e for example)
    If i encode e using 1 bit instead of 8 that saves 7 bits.

    This is the idea behind Huffman Coding.

    Binary data... well, depends on the data.
    I ran into this at work... basically, I was trying to reformat some data to save space on disk and eventually figured out that bzip would accomplish the same thing.
  • entropy (Score:2, Informative)

    by YesIAmAScript (886271) on Sunday June 26, 2005 @08:01PM (#12917133)
    You can calculate the amount of entropy in a document (text or no) and that is a limit to how small you could possibly make it.

    I don't recall how close modern methods like arithmatic encoding make it to that limit, but I know it's close enough that we couldn't double the compression ratio of text documents from the current state of the art.

    Trellis coding is a system for dealing with induced errors in modem signalling. It allows you to cancel some of them out. It doesn't actually increase the throughput in an ideal situation.

    The thing that allowed us to reach the limit for a phone line is combined amplitude-phase coding, or the creation of the "constellation diagram" for modem encoding.

    The constellation defines certain combinations of phase and amplitude that represents groups of bits (a baud). Trellis coding simply defines additional combinations that are not sent. If you see any of these on the receiving end, then you realize that the constellation is either being twisted (phase error) or shrunk/grown (amplitude error) and you can try to compensate for it.

    The name comes from a trellis, like you grow plants on. The legal signals sent should go through the holes in the trellis. If you receive a signal that falls on the trellis (hits the trellis) you adjust it so that it goes through the trellis and assume this adjustment factor can be used to adjust other, valid hits too to more accurately determine the data that was sent.

For every bloke who makes his mark, there's half a dozen waiting to rub it out. -- Andy Capp

Working...