Data Archiving Standards Need To Be Future-Proofed 113
storagedude writes Imagine in the not-too-distant future, your entire genome is on archival storage and accessed by your doctors for critical medical decisions. You'd want that data to be safe from hackers and data corruption, wouldn't you? Oh, and it would need to be error-free and accessible for about a hundred years too. The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum. Newman calls for standards groups to add new features like collision-proof hash to archive interfaces and software.
'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.
'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.
Nope (Score:2, Offtopic)
While there certainly is an issue with data integrity and retention, it is unlikely that anyone will need their entire DNA sequence "stored" for future use. It's becoming clear that the DNA you're born with isn't the same as the DNA you have when they recycle you. Further, medicine doesn't need your entire genome. Just the part that the doctor (or whatever they're called at that point in time) is interested in.
It is far more likely that you will be resequenced as needed.
Besides, you won't be able to affo
Re: (Score:1)
Besides, you won't be able to afford it anyway.
Why not? Whole genome sequencing is already down to a few thousand dollars. Within the next decade it will almost certainly have dropped below a thousand. And there will be standard analysis pipelines (hopefully some of which are freely available and open source) to check for the most common pathogenic mutations. Now, paying an expert to do a custom analysis could easily reach into the hundreds of thousands of dollars. But just I'm not seeing why the sequencing itself would be unaffordable.
Many other reasons to store data (Score:2)
While you may be right about the current use we have for DNA, it's very likely that medicine will have many more uses for it in the future. Prices on genome sampling are going down rapidly too, so it's reasonable to use this as an example why we might want to store data error free for at least a century.
There will be many more things we want to store. Remember all those old city records and paper books? The news paper archives? early 20th century cellulose film? All those data sources have their problems a
FITS standard (Score:2)
Keep your important data on current storage. (Score:5, Insightful)
Keep your important data on current mainstream storage. This is the only way to preserve it - copy data from old disks to new disks whenever you upgrade.
Of course at each upgrade you can also discard a lot of data that isn't necessary, but pictures and similar stuff shall be preserved. Data formats for images have been stable for the last decades. Even though some improvements have occurred a 25 year old jpg is still viewable.
However some document formats have to be upgraded to latest version since especially Microsoft have a tendency to "forget" their old versions. You may still lose some formatting, but the content of the documents is the important.
Re: (Score:2, Informative)
JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.
Re: (Score:2)
But seriously, JPG is everywhere now but how long will it last? Could you read pre-JPEG image formats? Do you have software that will open PhotoCD, or PBM, or XBM or IFF in all their variants? I expect some formats like DNG will be around for a while, but the XMP processing instructions contain a "process version", and how long will software continue to support the process versions we use today? Data security really isn't straightforward when you don't know what the future holds.
Re: (Score:2)
yep, out of the box my windows 7 laptop could read GIF89a and Targa formats. Pretty sure yours could too.
Re: (Score:1)
You're picking easy formats. What about Macintosh PICT, including vector information (not just a bitmap PICT)? What about WMF? I could see both those formats becoming effectively unusable within a decade, as they effectively depend on the drawing API/environment of ancient operating systems (classic MacOS and Windows 3).
Re: (Score:2)
ImageMagick doesn't support PICT with vector information. I'm just trying to make a point that even if a format seems to be widespread now, it may become effectively useless in the future. Believe me PICT files were everywhere in the classic Mac days.
Re: (Score:1)
Which goes to show that you better don't use proprietary formats that are used only by one software vendor for archiving purposes, not matter how "everywhere" they are at a specific point in time.
Re: (Score:1)
And LBM files were everywhere in the DOS/AMIGA days due to artists using Deluxe Paint for graphics. Still, your point is about implementation details and barking at the wrong tree. At that point in time, where (I believe you) PICT files were everywhere in the classic Mac days, were there not ANY open source libraries for reading them? If there were none, you are screwed anyway due to storing your documents in a closed format, nothing can save your soul (maybe a VM with that crap closed source software). Yo
Re: (Score:2)
PICT is a legacy Mac format, precursor to PDF. WMF is tangentially similar in that it uses function calls (PICT uses opcodes) to "draw" a scalable image, however the WMF specification continues to be updated to this day (last update was in February?).
You couldn't use either on a RISC box running on RISC OS 3.1 (without plugins and/or serious hacking), so for me they're both useless for archiving right out the gate.
You want an open vector standard such as SVG (for the simple reason that future systems will b
Re: (Score:3)
Yeah, SVG renderers have more chance of being around in 50 years than WMF or PICT. But you still need to actively go through your data archives, find things in "endangered" formats, and migrate them to more future-proof formats. This requires substantial effort that increases as the collection grows. Then there's verifying that nothing was lost in the conversion to consider.
Re: (Score:2)
archiving serious amounts of data does require careful forethought. Actually, I would say that archiving your photo collection requires as much forethought. Futureproofing is but one facet of the problem, you've also got disaster preparedness among a great many other things to consider. Storage media, not just the file format, is another. Will a floppy disk drive be available in fifty years? How about a five inch optical disc reader? Quarter inch tape? DAT? Vinyl? Etched steel plate? How resilient is your s
Re: (Score:2)
Not WMF. I had to write a WMF module to generate graphics commands for a laser printer back ages ago. It wasn't that hard.
Actually, I'm reasonably sure that WMF or a descendent of it is used for the device-independent spool format on modern versions of Windows. Since it's basically a recording of the GDI commands.
Still, a better bet would be to convert those WMFs to Postscript format if you want real longevity.
My votes for things most likely to still be decodable 1000 years from now are PDF/Postscript, JPEG
Re: (Score:2)
it's basically a recording of the GDI commands.
There were a number of WMF exploits just because of this - because the WMF parser had insufficient bounds checking and you could pass malformed input directly to the Win32 API just by sending someone a picture.
This is also part of the reason that Microsoft Office Open XML isn't an implementable standard - because it contains a bunch of stuff that boils down to "call the Windows API".
Re: (Score:2)
Re: (Score:3)
JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.
So what's your point? I have GIF images that predate 1989. The still render just fine. I could convert them if I felt the need. I don't, because the format's are indeed "stable".
Re: (Score:2)
What I want to know is, what ever happened to fuse-based proms, and why we can't use similar technology to store important data? I have to believe that, with current technology, we could create proms with a density at least as high as current usb keys, and since they're just microscopic wires in a hermetically sealed package, they'd last basically forever.
Re: (Score:1)
Well, maybe fuse based PROMs with dies the size of a 12" record album jacket.
Re: (Score:1)
BSOD DNA=TMNJ
Re: (Score:3)
Seriously, what's wrong with the MS Word .doc format? Feature complete, stable, lots of free implementations.
Because it's not feature complete (otherwise Microsoft wouldn't keep adding features), it's not stable, and the free implementations aren't completely compatible.
data archiving format in 500 years; but wouldn't be surprised if a good old-fashioned .doc works just fine.
You can have trouble opening a .doc from a few years ago......
Re: (Score:1)
I really really really hope you're just trying to troll.
Re: (Score:2)
Even MS can't say exactly what that spec is. Sure, there's an alleged standard but Word never actually followed it and in spite of over 1000 pages of documentation, it's incomplete.
Re: (Score:1)
uh... because the MS Word .doc format is a proprietary binary format that's closed up tighter than a spinsters snizz? MS Word is not, never has been and never will be a legitimate document exchange format, and so far away from an archival format it's not funny.
Future proofing a document in my experience has involved the following:
removing unnecessary formatting;
removing unnecessary whitespace;
if images are absolutely essential, supply them in uncompressed and/or lossless format (ie TIFF, GIF89a (although th
Punch cards (Score:2)
Re: (Score:2)
Glass master CDs? Anything that's sufficiently shielded, and the shielding isn't actually all that hard to make?
Re: (Score:2)
.
It's the end of the world, how will you save your data?
Re:Punch cards (Score:5, Insightful)
The ultimate strategy is to duplicate it in so many different areas that at least one of them survives. Preferably multiple ones.
The more critical the data, the more spots you duplicate it in.
Though you have to realize that eventually everything will be lost.
Re: (Score:1)
Re: (Score:1)
.
Will a CD-ROM survive at 400 degrees Fahrenheit? Punch cards and rocks will.
Re: (Score:2)
Re: (Score:3)
What is the high temperature limit for optical media?
.
Will a CD-ROM survive at 400 degrees Fahrenheit? Punch cards and rocks will.
But what about 451 degrees Fahrenheit? You're down to rocks at that point.
Re: (Score:3)
stamped / punched stainless steel sheets would probably be about the best option, if you wan't something to really stick around. Less brittle than rock carvings too.
Re: (Score:3)
What other storage medium, besides rock carving, can survive an EMP blast?
Nearly all of them. Flash media, including SD-cards, SSD, etc. should survive. A HDD that is powered off, should survive. The biggest threat is to anything that is connected to mains power. The power supply in your desktop computer may die, but a powered off laptop should be fine.
More than just data (Score:2)
Re: More than just data (Score:3)
We store genomic variation data in VCF files - it's just tab-delimited-text.
Re: (Score:1)
2075: "A tab? How quaint"
Re: (Score:2)
Re: (Score:3)
More importantly: it's a regular, repeating sequence that would visible separate variable data.
Even with no knowledge of what a tab is, it would be obvious in analysing the data that it was doing something special. Anyone with some knowledge of DNA's structure would be able to infer the rest.
Re: More than just data (Score:2)
Yep - it'll be way easier to view genomic data in the future than an excel document. Bioinformaticists are lazy, so we store everything as text :)
There is a lot we need for long term archiving (Score:5, Informative)
The problem is that we do have formats that do work for long term archiving, but are limited to a platform and are not open, so decoding them in the future may be problematic.
WinRAR is one example. It has the ability to do error detection and correction with recovery records. However, it is a commercial product.
PAR records are another way, but it is a relatively clunky mechanism for long term storage.
Even medium term storage on disk/tape can be problematic:
There is one standard for backup programs for tape, and that is tar. Very useful format, but zero error correction or detection, other than reading and looking for hard errors. There are tons of backup programs that work with tapes. Networker, TSM, NetBackup, and many others come to mind, all using a different format. Of course, once you get the program, there is still finding the registration key, and some programs require online activation (which means when the activation servers get shut off, you can never do a restore from scratch again.) We need one archive grade standard for tape, perhaps with a standard facility for encryption as well.
Same with disks. It wasn't until recently that there was any bit rot detection in filesystems at all. Now with ReFS, Storage Spaces, ZFS, and btrfs, we now can tell if a file is damaged... but none of the filesystems have the ability to store ECC on an entire (other than ZFS and ditto blocks.) It would be nice to have part of a filesystem be a large area for ECC on a block basis. It would take some optimization for performance, but adding ECC in the filesystem is more geared for long term storage than day to day file I/O.
Finally there is paper. Other than limited stuff on QR codes, there isn't any real way to print a document onto paper, then scan it to get it back. There was a utility called Paperbak that purported to do this, offering encryption, error correction, various DPI codes, and so on. It printed well, but could never scan and read any of the documents printed, so it is worthless. What is needed is something like the Paperbak utility, but with a lot more robust error detection (like checking of blocks are at an angle similar to how QR codes can be scanned from any direction.) This utility would have to be completely open for it to have any use at all. However, if it could be done to print small documents to paper, it would help greatly in some situations, such as recovering encryption keys, archived tax documents, and so on.
Ironically, in general, we have the formats for long term storage. We just don't have any that are open.
Hardware is an issue too. Hard drives are not archival media. Tapes are, but one with a reasonable capacity is expensive, well out of reach for all but the enterprise customers. It would be a viable niche for a company to make a relatively low cost tape drive that could work on USB 3, has a large buffer (combined with variable tape speeds to prevent shoe-shining), and has backup software with it that is usable and open, where the formats can be re-engineered years down the road for decoding.
Re: (Score:2)
As far as long term media, we have mdisc. [mdisc.com] Whether or not we'll have anything that can read the intact medium is another issue.
It's sad how we're still able to print from photographic plates shot a century ago, but I'm worrying about bit rot on my digital pics stored for 5 years.
Re: (Score:2)
There was an IBM computer made in the 1970s which stored data on black and white negatives. It would "write" to them via exposing light, then pass the negatives through the usual developer, stop, and fixer baths, finally into a storage area. Reading was done by having them scanned in, similar to punchcards.
It definitely is a nonstandard way of doing things, but I'm sure film chemistry has advanced quite well since then, so storing information as colored dots might be a long term archiving solution, provid
Re: (Score:2)
Images are a sparse data set though. See the preponderance of techniques which rebuild a nearly complete image from 1% of the pixels.
If you took those negatives and tried to write densely packed information to them, how recoverable would it be then?
Re: (Score:2)
I should have been clearer -- Paperbak is a way to not just print a document, but encode one onto paper, so a 100 page Word document fits on a single page (in theory), rather than needing 100 pages.
Absolutely not (Score:1)
You won't need to archive your genome. It will be re-sequenced in 5 seconds each time you go to the doctor. Because it will be cheap, and because it may evolve over time. The same way blood samples are not archived for life, or teeth X-rays are taken periodically, they're just taken when needed.
Re: (Score:1)
Hacked by Hell (Score:1)
Wakes up, "WTF? I have a....Vagina!? Hoooneeeyyy!"
My proposal (Score:1)
I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?
A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.
Re: (Score:2)
I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?
A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.
But then cosmic rays and ionizing radiation and other things will still introduce errors.
So we would further need a method to reliably store the chains themselves and that could replicate the data to ensure there was a high chance of accurate data surviving. Little cartridges with all of the necessary environment and materials to power the reading system and maintain the chain and that could, as needed, replicate the data into new cartridges. The second versions, Contained Environment II (CE-II) work decent
Re: (Score:2)
Who cares? I will still have about 30 trillion intact copies.
Re: (Score:2)
Re: (Score:2)
You can't (Score:2)
Technology is always changing. Whatever is today's commodity storage device will be tomorrow's rare anachronism.
There's always the main backup. (Score:2)
You!
just a reminder (Score:1)
Re: just a reminder (Score:1)
Gimmie tape. (Score:1)
Paper tape (Score:2)
Re: (Score:2)
Get the acid-free paper. Will last forever
Or until it gets wet.
ZFS (Score:1)
Live data lives (Score:2)
Your bank records exist despite changing hardware and software because the data is kept in use. Its kept alive. It is added to, modified... active. Your genetic records could be kept active. Keep them part of a patient record and they'll be copied, migrated, translated, from one system to the next to the next to the next for as long as you live.
Only when the data goes dormant can it rot. By all means... have long term storage media for long term data archiving. But the best means of keeping data current is
Too bad your DNA is useless to most MDs (Score:3)
... or for that matter any of your medical history. MDs do spot-diagnosis in 5 minutes or less based exclusively on what they've memorized or else they do no diagnosis at all.
My wife has a major genetic defect (MTHFR C677T), which causes severe nutritional problems. We haven't yet met an MD who has a clue about nutrition. Moreover, we had to diagnose this problem ourselves through genetic testing, with no doctors involved. We've shown the results to doctors, and they don't entirely disbelieve us, but they also have no clue what to do about it and still are dubious of the symptoms. (Who has symptoms of Beriberi these days? Someone whose general ability to absorb nutrients is severely compromised.)
What makes anyone think that this will change if your doctor has access to your DNA, even with detailed analysis? They won't take the time to actually read any of it. In fact a lot of what we know about genetic defects pertains to problems in generating certain kinds of enzymes, a lot of which participate in nutrient absorption. (So obviously RESEARCHERS know something about nutrition.) These nutritional problems require supplementation that MDs don't know about. Do you think the typical MD knows that Folic Acid is poison to those with C677T? Nope. They don't know the differences between folic acid, folinic acid, and methylfolate and still push folic acid on all pregnant women (they should be pushing methylfolate). They also don't know the differences between the various forms of B12 and always prescribe cyanocobalamin even for people who need the methyl and hydroxy forms.
Another way in which MDs are useless is caused by their training. Bascally, they're trained to be skeptical and dismissive. Many nutritional and autoimmune disorders manifest with a constellation of symptoms, along with severe brainfog. Someone with one of these problems will generally want to write down the symptoms when talking to a doctor, because they can't think clearly. The thing is, in med school, doctors are specifically trained to look out for patients with constellations of symptoms and written lists, and they are told to recognize this as a condition that is entirely within the mind of the patient. Of course, a lot of doctors, even if not trained to dsmiss things as "all in their head" are terrible at diagnosis anyway. They'll have no clue where to start and won't have the patience to do extensive testing. It's too INCONVENIENT and time-consuming. They won't make enough money off patients like this, so they get patients like this out the door as fast as possible.
I've had some good experiences with surgeons. But for any other kind of medical treatment, MDs have been mostly useless to me and my family. In general, if we go to one NOW, we've already disgnosed the problem (correctly) and possibly need advice on exactly which medicine is required, although when it comes to antibiotics, it's easy enough to find out which ones to use. (Medical diagnosis based on stuff you look up on the internet is really hard and requires a very well-trained bullshit filter, and you also have to know how to use the more authoritative sources properly. However, it's not impossible for people with training in things like law, information science, and biology. It just requires really good critical thinking skills. BTW, most MDs don't have that.)
MDs are technicians. Most of them are like those B-average CS grads from low-ranked schools who can barely manage to write Java applications. If you know how to deal with a low-level technician, guide them properly, and stroke their ego in the right way, you can deal with an MD.
Re: (Score:3)
Paraphrased:
I forgot that doctors are people, and that the bottom half are generally worthless, and the average ones are average. Also, diagnosing a rare problem is hard because it is unlikely to be a rare problem.
I also forgot that doctors are the people who didn't tire of medical school shenanigans and change studies.
And I bear a grudge because I didn't find that top notch House like genius who, despite being wrong every show, succeeds in the end.
Finally, I have no idea why and how insurance, both medical
Re: (Score:1)
Re: (Score:2)
We seriously considered chronic lyme as a possibility and even got testing. The test came back negative, although there can be false negatives. We ultimately ruled it out on the basis of certain key symptoms being absent. Basically, we considered a LOT of things and did our best to rank the changes of each illness that might explain the symptoms. We were open to the idea of more than one cause but considered it a remote possibility; fortunately we were right.
Anyhow, homozygous MTHFR C677T can be serious
Archiving vs backups (Score:2)
One of the big differences between archiving and backup is that in archiving I want to keep this exact version intact, if it changes on me it's an error while a backup takes a copy of whatever is now - maybe I wanted to edit that file. Unlike backups I think it's not about versioning, it's about maintaining one logical instance of the archive across different physical copies. Here's what I'm thinking, you create a system with three folders:
archived
to_archive
to_trash
The archive acts like a CD/DVD/BluRay and
What we need is viable storage and maintenance, (Score:2)
for the huge and growing number of people on this planet. I get how wonderful it is that genetic medicine might allow us all to live to the age of 150, eliminate birth defects, and cure Aunt Millie's cancer. But really, just where are we going to put all the people whose lives we save and extend while at the same time the birth rate keeps climbing? How will we feed them? How will we maintain a viable biosphere in an era of rapidly accelerating extinctions?
All that long term data will be meaningless if human
You didn't say the magic word (Score:2)
Something about securing genomes, coming from a guy called Newman? And not a single Jurassic Park joke after 79 posts?
What a shame.
Pragmatic: continual, active refresh (Score:2)
One can whine and wax poetic all one wants, but since we don't have a good archival format, the practical solution today is continual refresh of data: periodically copying data to fresh, and technologically up-to-date media. It's not sexy, but it does address three of the four points at the end of the linked piece (end-to-end data integrity, format migration and secondary media formats). The unaddressed point, access audit trails, makes no sense given the premise stated at the beginning of the piece that "N
Why not store the DNA itself? (Score:2)
Re: (Score:2)
Nobody ever needs to know their complete genome and nobody ever will need to. Instead, you'll go to a doctor with a complaint and if they suspect a genetic component, they'll do a cheek swab and a quick test tuned to look for the particular genetic condition you might have. Or if something really exciting and common is discovered, you'll be offered an opportunity to get a new test to see if you're at risk for living to be 200. (You need to be warned because you probably won't have saved enough for near-p
My article about it in Communications of the ACM (Score:2)
Digital libraries have been doing this for decades (Score:1)
Arvados: the open source solution (Score:1)
(Disclaimer: I am an Arvados developer)
The Arvados project [arvados.org] is a free and open source (AGPLv3 and Apache v2) bioinformatics platform for genomic and biomedical data, designed to address precisely the issues raised in this article. Arvados features a 1) content addressed filesystem (blocks are addressed by a hash of their actual content rather some arbitrarily assigned identifier) which performs end-to-end data integrity checks , 2) fine-grained access controls, 3) a cluster scheduling system that tracks the
Need to borrow a ladder (Score:1)