Archiving Digital History at the NARA 202
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
16000 formats?!? (Score:4, Funny)
347 petabytes? (Score:5, Insightful)
I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.
Re:347 petabytes? (Score:3, Informative)
Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archi
Re: (Score:2)
Re:347 petabytes? (Score:4, Informative)
Re:347 petabytes? (Score:2)
Re:347 petabytes? (Score:2)
Note that they don't say which mailbox in the Clinton administration...
Re:347 petabytes? (Score:3, Informative)
Now I'm sure the gov would use a faster system than my laptop, but still!
Re:347 petabytes? (Score:2)
The funny thing is I got an A in Calc III last semester.
Re:347 petabytes? (Score:2)
also you're using modern processors and hard drives, by 2022 347 petabytes won't be anything when we all have terabyte hard drives... think about it, that's 17 years, how big/fast was your hard drive 17 years ago? Let's see... 1988... I didn't even have a hard drive, still all floppy.
By 2022 we'll all have hundreds of terabyte drives and measuring transfer rates in gB/sec, if not larger/faster. Sor
Re:347 petabytes? (Score:2)
One thing though, wouldn't it still be linear for the entire process? I mean I understand what you are saying as far as the algorithm goes. It's not necessarily going to take twice as long for the algorithm that creates the index to run createIndex(a,b,c,d) compared to createIndex(a,b).
But you still have to scan twice as many files to derive the inputs. How could that part not be linear?
Try to help correct other's math sans sarcasm. (Score:5, Insightful)
To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.
Compression and moderation? (Score:2)
Data loss will always be a possibility (Score:5, Insightful)
Re:Data loss will always be a possibility (Score:4, Insightful)
The key, I think, is prioritization. We all do it individually (important stuff gets backed up many times and often, unimportant stuff perhaps never backed up), and NARA will have to do it too. I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance. The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small.
True but... (Score:2)
I agree really but I also find the problem with data is you never know until its too late. The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.
I feel that if you can't completely automate backups then the best thing is to give users easy access to backup resources for their own material
Re:True but... (Score:2)
I gree with this completely.
The article mentioned the selective retention of information as one possibility for coping with the massive amounts of data that need to be preserved.
I think that it would be a mistake to do this.
IMO, all data should be archived in bulk as soon as possible, and then scholars can work on indexing those portions that they deem impor
Re:Data loss will always be a possibility (Score:2)
Now I'm not saying we should all go back to Stone Age, but it does make you think about the irony of progress...
Re:Data loss will always be a possibility (Score:3, Funny)
True. But I hardly think Alexandria was lost to the tap of the Y key, a pregnant pause, then an "oops."
Re:Data loss will always be a possibility (Score:3, Funny)
Re:Data loss will always be a possibility (Score:2)
Thanks for your selflessness. Or perhaps your Obsessive-Compulsive Disorder.
Answer is Compression? (Score:5, Informative)
Perhaps, the answer is compression.
Does anyone know whether there is an upper limit to text compression?
In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?
Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.
If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.
Re:Answer is Compression? (Score:2)
1. Well, I'd be surprised as long as you don't make any assumptions about the statistical distribution of bits in the text you want to compress.
Re:Answer is Compression? (Score:2)
Which is obviously some hot gas coming from your posterior. Otherwise: 1 (the Holy bible, heavily compressed)
The amount of compression possible in a given string of numbers is inversely proportional to the amount of randomness in the input.
Re:Answer is Compression? (Score:2)
Depending on how specialized your data is, it might be a net win to do
Re:Answer is Compression? (Score:2)
Speaking of which, don't we have to consider indexing this megalith? And if things haven't changed *that* much since I was a DBA, you can easily have indexing that takes ten times the storage of the raw data itself. Better factor that in, too.
Re:Answer is Compression? (Score:2)
I can easily thing of it as a compression scheme. If they wanted to have it communicate all of that information they could have devised "Morse Code", and actually spelled it out. This is obviously much shorter. The code they specially designed for this single use was exactly as described.
You can think of it as an indexing scheme if you feel like it, but that doesn't mean it's any less legi
Re:Answer is Compression? (Score:2)
Such a scheme wouldn't be very useful for general use, of course
Re:Answer is Compression? (Score:2)
Re:Answer is Compression? (Score:3, Interesting)
For example, if you have a text file with letters of equal probability (all letters have a probability
Re:Answer is Compression? (Score:2, Informative)
It is some of Shannon's work on Information Theory.
Basically, information has entropy associated with it. Entropy being the randomness of information. Truly 100% random information cannot be compressed.
The central idea has to do with the probability of something occuring.
Text compresses quite well because certain letters are more common than others and there are a limited number of symbols. (e for example)
If i encode e using 1 bit instead of 8 that saves 7 bits.
This is th
entropy (Score:2, Informative)
I don't recall how close modern methods like arithmatic encoding make it to that limit, but I know it's close enough that we couldn't double the compression ratio of text documents from the current state of the art.
Trellis coding is a system for dealing with induced errors in modem signalling. It allows you to cancel some of them out. It doesn't actually increase the throughput in
Re:Answer is Compression? (Score:2)
Re:Answer is Compression? (Score:2)
That, ofcourse, strongly depends on the entropy of the text to be compressed. When you're talking about the current president's email, well, there can't possible be a whole of entropy in there, so it should be really easy to compress.
Re:They use TIFF? (Score:2)
ha (Score:3, Funny)
You'll have to pry it from my cold, dead hands!
Ohhhh, NARA, not NRA....
Retain it all. (Score:2, Insightful)
Google to the rescue!!! (Score:4, Funny)
nara.google.com
Oh, wait... I'm getting ahead of myself...
Difference between data and trash (Score:5, Insightful)
Re:Difference between data and trash (Score:2)
The information I need is preserved in an easily accessible form because I made a decision to make all my class notes organised, and as a result I've replaced
Re:Difference between data and trash (Score:2)
Just an idea...
Re:Difference between data and trash (Score:2)
In the story they talk about multiple revisions of word documents written by leaders, and photos of the effects of agent orange. Do you consider those things "crap"?
The fact is, the government is huge, and there is a hell of a lot of important information to be saved over the years.
Re:Difference between data and trash (Score:2)
Re:Difference between data and trash (Score:2)
Besides that, revisions are very, very small, so it's not as if storage is a real problem. When your 500GB hard drive is full, you don't go through and delete all your unneeded text files first,
Maybe it should have been 45 million e-mails (Score:2)
http://archives.cnn.com/2000/ALLPOLITICS/stories/
Dark Ages (Score:5, Insightful)
I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.
Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.
Personally, I take the "you never know" ideology and save everything.
Re:Dark Ages are ahead! All aboard (Score:2, Funny)
Re:Dark Ages are ahead! All aboard (Score:2)
So historians in 2100 will have to wade through various trolls and defacement attemps to try to get what people thought about in 2005 - but at least they'll know not to click on Goatse links [wikipedia.org].
Re:Dark Ages (Score:2)
Not a dark age... was the past so bright? (Score:5, Insightful)
Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).
By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.
To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.
Re:Not a dark age... was the past so bright? (Score:2)
Cost-of-copy and modes of failure (Score:3, Interesting)
Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor
Answer is not compression, it's less data. (Score:3, Insightful)
The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.
Bottom line: with the 100 + 38 million dollars (FTFA) assigned to the project I am sure I could eliminate a number of redundant positions, optimise some communication channels, retire voluminous individuals, replace inefficient protocols/people, and basically reduce the sources of data. Hell, if the US were to actually have peace instead of demand it, there would be a much reduced need for military inteligence, political rhetoric, and other civil responsibilities. The military could be half the size, and what do you know, we could not only reduce the requirement for archiving, but could actually save money in the process.
Remeber, govenment is a self-supporting process.
Go ahead, mark me a troll.
gus
Re:Answer is not compression, it's less data. (Score:2, Insightful)
That's only a 60% reduction. A 60% reduction of 347 PB is still 138.8 PB...still a huge archival task.
Keep 1% of the data still leaves you with 3.47 PB. Not impossible, but still a daunting task.
Re:Answer is not compression, it's less data. (Score:2)
I'm sure I could do that in
burn, knowledge, burn (Score:3, Interesting)
Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?
Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.
It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?
Re:burn, knowledge, burn (Score:2, Interesting)
Re:burn, knowledge, burn (Score:5, Insightful)
Absolutely, yes!
History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.
One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.
History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.
The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?
Those memos and IMs comprise that understand of people today.
Re:burn, knowledge, burn (Score:2)
It is only through examining the artistry, the great works, the monuments that withstand the test of time possibly, for those are the things which were the attempts of that culture to enrich themselves. Perhaps in
Re:burn, knowledge, burn (Score:2)
Re:burn, knowledge, burn (Score:2)
Re:burn, knowledge, burn (Score:2)
So? (Score:3, Insightful)
contract for archiving system (Score:2)
I wonder what kind of searches and cross-linking will be done, for instance. What kinds of access control there will be? I'd also just like to see what the 16,000 formats are, out of curiosity. Sounds like a project waaaay larger than the $136 million they've allotted for it so far.
Stupid name.... i'm guessing they w
Have a look at the Fedora Project (Score:4, Funny)
(Not to be confused with the Linux distribution)
From the website, Fedora is "a general purpose repository service...devoted to...providing open-source repository software that can serve as the foundation for many types of information management systems".
Problem for some is that Fedora can be a little hard to grok. It's not an out-of-the-box repository to install and run, like the repository application mentioned in the article (DSpace). It's an architecture for building repository software. Once you understand the potential for building applications on top of Fedora, you start to see some light at the end of the tunnel for just the sort of issues the article raises.
Re:Have a look at the Fedora Project (Score:2)
Fedora is oriented toward digital library work, which I suspect has some carry over with archival work at NARA. They would be wise to look at it, but I'll say from our personal experience, it is a major task to get our materials into Fedora. I don't mean this in any way t
Relevant, interesting post (Score:5, Funny)
And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.
What do you think about Ralph's thoughts?
Re: (Score:3, Insightful)
Re:Relevant, interesting post (Score:2)
I think that must be a bad sign.
And the last stupid joke I tried to make (a goddamn PUN) got modded "interesting".
Sigh.
Re:Relevant, interesting post (Score:2)
And the last joke I tried to make got modded "interesting". It was a goddamn PUN, people! There was nothing interesting about it!
Sigh.
Watch this be modded "anti-semitic" or something.
Re:Relevant, interesting post (Score:2)
"My cat's breath smells like cat food." - Ralph
I think he means it.
Slightly overdramatic? (Score:2, Insightful)
Tanks for the Memories (Score:3, Interesting)
industrial espionage would be sillier (Score:2, Funny)
"What do you want??"
"That Gem...and the Holograms."
Re:Tanks for the Memories (Score:2)
Records (Score:3, Informative)
For example, the email to my supervisor asking when I can take a week's vacation isn't a record. The leave request form I get him to sign is a record. An email about lunch plans: not a record. An email to a coworker about a grant application probably is.
Besides obvious records (eg: financial and legal records), there are many documents that may or may not be records. For the most part, it's up to each program to decide which documents are records and archive them appropriately.
the more I think about it... (Score:3, Insightful)
I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records.
.
strip MS HTML from Outlook mails (Score:5, Funny)
Moore's Law saves the day (Score:3, Interesting)
Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.
Re:Moore's Law saves the day (Score:2)
Moore's Law and storage (Score:2)
You are right -- Gordon Moore spoke only of trends in the number of transistors/IC. Yet his law was, if anything, about advances in the technologies of miniaturization. This miniaturization has had profound, indirect effects on storage. The same
I'm guessing... steady state. (Score:4, Interesting)
I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.
And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.
Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.
So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.
But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.
It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.
Yeah, but that's 17 years away. (Score:2)
The Solution: (Score:3, Funny)
Money (Score:2)
To add insult to injury, slime-sucking lawyers now advise their clients to destroy records, like email, as soon as possible t
Re:Money (Score:2)
347 petabytes = ? Libraries of Congress (Score:2)
or how many beowulf clusters are needed to search it? sort it?
Very Large Storage Array (Score:2)
You may ask "why do you want to build the most powerful Beowulf cluster on Earth when storage companies have all these amazing storage systems ?" Well, this system needs to be an open solution. The system will need to grow and evolve as the needs change. Vendor lock-in is si
I was talking about that in another arena: (Score:2)
http://slashdot.org/comments.pl?sid=154005&cid=12 9 17603 [slashdot.org]
My opinion?
The 21st century will disappear from history. In 500 yearstime they will know more about Italy of 1505 than the USA of 2005. Why? the records of Italy will still exist.
The entire digital info system is based on the free ride of petroleum. Petroleum will basically disappear from society fairly soon, (either it will simply deplete, or will become too expensive to drill it out) and everything made of plastic and anything re
Re:I was talking about that in another arena: oops (Score:2)
the link shoud have been:
http://www.amazon.com/exec/obidos/tg/sim-explorer/ explore-items/-/0670033375/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/103-5019446-5179842 [amazon.com]
RS
Please contact CERN (Score:2)
The Goal is Data Loss (Score:2)
One of GWB's first acts was to lock down the Reagan administration's (and, all subsequent administration's) data forever. The 12 year release cycle that the Ford Administration approved was revoked within weeks of Jan 2000 (some cynics say, to prevent data about Iran-Contra and GHWB's involvement becoming public - but that's just crazy talk).
The only data less available than old parchment in a vault is random magnetic domain
Electronic Presidency (Score:2)
Redundant (Score:2)
Who really needs all of this data? (Score:2)
Re:Usually when I archive... (Score:2, Interesting)
Re:Usually when I archive... (Score:2)
Every mail is sacred (Score:3, Insightful)
If a mail is wasted
The gods get quite irrate
Every mail is wanted
Every mail is good
Every mail is needed
In your network neighborhood
Really, the idea of not being able to record and save every post-it note being equated with those times and places where writing itself was denigrated into virtual nonexistence is a bit silly.
KFG
Agreed (Score:2)
Governmental psychosis is costly.
.
-shpoffo
Re:Why do we need to archive everything? (Score:4, Insightful)