Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Security

Data Archiving Standards Need To Be Future-Proofed 113

storagedude writes Imagine in the not-too-distant future, your entire genome is on archival storage and accessed by your doctors for critical medical decisions. You'd want that data to be safe from hackers and data corruption, wouldn't you? Oh, and it would need to be error-free and accessible for about a hundred years too. The problem is, we currently don't have the data integrity, security and format migration standards to ensure that, according to Henry Newman at Enterprise Storage Forum. Newman calls for standards groups to add new features like collision-proof hash to archive interfaces and software.

'It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data,' writes Newman.
This discussion has been archived. No new comments can be posted.

Data Archiving Standards Need To Be Future-Proofed

Comments Filter:
  • Nope (Score:2, Offtopic)

    by ColdWetDog ( 752185 )

    While there certainly is an issue with data integrity and retention, it is unlikely that anyone will need their entire DNA sequence "stored" for future use. It's becoming clear that the DNA you're born with isn't the same as the DNA you have when they recycle you. Further, medicine doesn't need your entire genome. Just the part that the doctor (or whatever they're called at that point in time) is interested in.

    It is far more likely that you will be resequenced as needed.

    Besides, you won't be able to affo

    • by Anonymous Coward

      Besides, you won't be able to afford it anyway.

      Why not? Whole genome sequencing is already down to a few thousand dollars. Within the next decade it will almost certainly have dropped below a thousand. And there will be standard analysis pipelines (hopefully some of which are freely available and open source) to check for the most common pathogenic mutations. Now, paying an expert to do a custom analysis could easily reach into the hundreds of thousands of dollars. But just I'm not seeing why the sequencing itself would be unaffordable.

    • While you may be right about the current use we have for DNA, it's very likely that medicine will have many more uses for it in the future. Prices on genome sampling are going down rapidly too, so it's reasonable to use this as an example why we might want to store data error free for at least a century.

      There will be many more things we want to store. Remember all those old city records and paper books? The news paper archives? early 20th century cellulose film? All those data sources have their problems a

  • by Z00L00K ( 682162 ) on Saturday September 20, 2014 @12:43AM (#47952227) Homepage Journal

    Keep your important data on current mainstream storage. This is the only way to preserve it - copy data from old disks to new disks whenever you upgrade.

    Of course at each upgrade you can also discard a lot of data that isn't necessary, but pictures and similar stuff shall be preserved. Data formats for images have been stable for the last decades. Even though some improvements have occurred a 25 year old jpg is still viewable.

    However some document formats have to be upgraded to latest version since especially Microsoft have a tendency to "forget" their old versions. You may still lose some formatting, but the content of the documents is the important.

    • Re: (Score:2, Informative)

      by _merlin ( 160982 )

      JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.

      • by Jawnn ( 445279 )

        JPEG wasn't standardised until 1992. THere are no 25-year-old JPEG files. Things have changed a lot since 1989.

        So what's your point? I have GIF images that predate 1989. The still render just fine. I could convert them if I felt the need. I don't, because the format's are indeed "stable".

    • by Dadoo ( 899435 )

      What I want to know is, what ever happened to fuse-based proms, and why we can't use similar technology to store important data? I have to believe that, with current technology, we could create proms with a density at least as high as current usb keys, and since they're just microscopic wires in a hermetically sealed package, they'd last basically forever.

  • What other storage medium, besides rock carving, can survive an EMP blast?
    • Glass master CDs? Anything that's sufficiently shielded, and the shielding isn't actually all that hard to make?

      • Don't forget temperature survival. Yeah, I mentioned EMP, but there are also other environmental attacks that must be diverted, such as temperature, and water. Shielding won't prevent something from melting.

        .
        It's the end of the world, how will you save your data?

    • by mirix ( 1649853 )

      stamped / punched stainless steel sheets would probably be about the best option, if you wan't something to really stick around. Less brittle than rock carvings too.

    • What other storage medium, besides rock carving, can survive an EMP blast?

      Nearly all of them. Flash media, including SD-cards, SSD, etc. should survive. A HDD that is powered off, should survive. The biggest threat is to anything that is connected to mains power. The power supply in your desktop computer may die, but a powered off laptop should be fine.

  • Preserving the bits accurately is only a small part of the problem. Knowing what the bits mean is critical. Having a bunch of .xlsx spreadsheet files in the year 2050 will be useless unless you also have Excel 2050, and it knows how to read them. Unless you want to basically just 'print' all your data to a format like .pdf (or just plain old .txt) programs to access data are as critical as the data.
  • by mlts ( 1038732 ) on Saturday September 20, 2014 @12:53AM (#47952261)

    The problem is that we do have formats that do work for long term archiving, but are limited to a platform and are not open, so decoding them in the future may be problematic.

    WinRAR is one example. It has the ability to do error detection and correction with recovery records. However, it is a commercial product.

    PAR records are another way, but it is a relatively clunky mechanism for long term storage.

    Even medium term storage on disk/tape can be problematic:

    There is one standard for backup programs for tape, and that is tar. Very useful format, but zero error correction or detection, other than reading and looking for hard errors. There are tons of backup programs that work with tapes. Networker, TSM, NetBackup, and many others come to mind, all using a different format. Of course, once you get the program, there is still finding the registration key, and some programs require online activation (which means when the activation servers get shut off, you can never do a restore from scratch again.) We need one archive grade standard for tape, perhaps with a standard facility for encryption as well.

    Same with disks. It wasn't until recently that there was any bit rot detection in filesystems at all. Now with ReFS, Storage Spaces, ZFS, and btrfs, we now can tell if a file is damaged... but none of the filesystems have the ability to store ECC on an entire (other than ZFS and ditto blocks.) It would be nice to have part of a filesystem be a large area for ECC on a block basis. It would take some optimization for performance, but adding ECC in the filesystem is more geared for long term storage than day to day file I/O.

    Finally there is paper. Other than limited stuff on QR codes, there isn't any real way to print a document onto paper, then scan it to get it back. There was a utility called Paperbak that purported to do this, offering encryption, error correction, various DPI codes, and so on. It printed well, but could never scan and read any of the documents printed, so it is worthless. What is needed is something like the Paperbak utility, but with a lot more robust error detection (like checking of blocks are at an angle similar to how QR codes can be scanned from any direction.) This utility would have to be completely open for it to have any use at all. However, if it could be done to print small documents to paper, it would help greatly in some situations, such as recovering encryption keys, archived tax documents, and so on.

    Ironically, in general, we have the formats for long term storage. We just don't have any that are open.

    Hardware is an issue too. Hard drives are not archival media. Tapes are, but one with a reasonable capacity is expensive, well out of reach for all but the enterprise customers. It would be a viable niche for a company to make a relatively low cost tape drive that could work on USB 3, has a large buffer (combined with variable tape speeds to prevent shoe-shining), and has backup software with it that is usable and open, where the formats can be re-engineered years down the road for decoding.

    • by bugnuts ( 94678 )

      As far as long term media, we have mdisc. [mdisc.com] Whether or not we'll have anything that can read the intact medium is another issue.

      It's sad how we're still able to print from photographic plates shot a century ago, but I'm worrying about bit rot on my digital pics stored for 5 years.

      • by mlts ( 1038732 )

        There was an IBM computer made in the 1970s which stored data on black and white negatives. It would "write" to them via exposing light, then pass the negatives through the usual developer, stop, and fixer baths, finally into a storage area. Reading was done by having them scanned in, similar to punchcards.

        It definitely is a nonstandard way of doing things, but I'm sure film chemistry has advanced quite well since then, so storing information as colored dots might be a long term archiving solution, provid

        • Images are a sparse data set though. See the preponderance of techniques which rebuild a nearly complete image from 1% of the pixels.

          If you took those negatives and tried to write densely packed information to them, how recoverable would it be then?

  • by Anonymous Coward

    You won't need to archive your genome. It will be re-sequenced in 5 seconds each time you go to the doctor. Because it will be cheap, and because it may evolve over time. The same way blood samples are not archived for life, or teeth X-rays are taken periodically, they're just taken when needed.

  • not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption

    Wakes up, "WTF? I have a....Vagina!? Hoooneeeyyy!"

  • by Anonymous Coward

    I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?

    A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.

    • by KitFox ( 712780 )

      I propose storing it in a new medium. A "molecular chain", which should withstand the effects of EMP, right?
      A name for it. Hmmm. How about the Destroy-Not Archive, or D.N.A. for short.

      But then cosmic rays and ionizing radiation and other things will still introduce errors.

      So we would further need a method to reliably store the chains themselves and that could replicate the data to ensure there was a high chance of accurate data surviving. Little cartridges with all of the necessary environment and materials to power the reading system and maintain the chain and that could, as needed, replicate the data into new cartridges. The second versions, Contained Environment II (CE-II) work decent

  • Technology is always changing. Whatever is today's commodity storage device will be tomorrow's rare anachronism.

  • We already have the technology to preserve the data: http://www.pcworld.com/article... [pcworld.com]
  • Just scrape off the rust and your good to go. Now, where did I put my M14G and FR3010.
  • Get the acid-free paper. Will last forever
  • nuff sed
  • Your bank records exist despite changing hardware and software because the data is kept in use. Its kept alive. It is added to, modified... active. Your genetic records could be kept active. Keep them part of a patient record and they'll be copied, migrated, translated, from one system to the next to the next to the next for as long as you live.

    Only when the data goes dormant can it rot. By all means... have long term storage media for long term data archiving. But the best means of keeping data current is

  • by Theovon ( 109752 ) on Saturday September 20, 2014 @06:12AM (#47952929)

    ... or for that matter any of your medical history. MDs do spot-diagnosis in 5 minutes or less based exclusively on what they've memorized or else they do no diagnosis at all.

    My wife has a major genetic defect (MTHFR C677T), which causes severe nutritional problems. We haven't yet met an MD who has a clue about nutrition. Moreover, we had to diagnose this problem ourselves through genetic testing, with no doctors involved. We've shown the results to doctors, and they don't entirely disbelieve us, but they also have no clue what to do about it and still are dubious of the symptoms. (Who has symptoms of Beriberi these days? Someone whose general ability to absorb nutrients is severely compromised.)

    What makes anyone think that this will change if your doctor has access to your DNA, even with detailed analysis? They won't take the time to actually read any of it. In fact a lot of what we know about genetic defects pertains to problems in generating certain kinds of enzymes, a lot of which participate in nutrient absorption. (So obviously RESEARCHERS know something about nutrition.) These nutritional problems require supplementation that MDs don't know about. Do you think the typical MD knows that Folic Acid is poison to those with C677T? Nope. They don't know the differences between folic acid, folinic acid, and methylfolate and still push folic acid on all pregnant women (they should be pushing methylfolate). They also don't know the differences between the various forms of B12 and always prescribe cyanocobalamin even for people who need the methyl and hydroxy forms.

    Another way in which MDs are useless is caused by their training. Bascally, they're trained to be skeptical and dismissive. Many nutritional and autoimmune disorders manifest with a constellation of symptoms, along with severe brainfog. Someone with one of these problems will generally want to write down the symptoms when talking to a doctor, because they can't think clearly. The thing is, in med school, doctors are specifically trained to look out for patients with constellations of symptoms and written lists, and they are told to recognize this as a condition that is entirely within the mind of the patient. Of course, a lot of doctors, even if not trained to dsmiss things as "all in their head" are terrible at diagnosis anyway. They'll have no clue where to start and won't have the patience to do extensive testing. It's too INCONVENIENT and time-consuming. They won't make enough money off patients like this, so they get patients like this out the door as fast as possible.

    I've had some good experiences with surgeons. But for any other kind of medical treatment, MDs have been mostly useless to me and my family. In general, if we go to one NOW, we've already disgnosed the problem (correctly) and possibly need advice on exactly which medicine is required, although when it comes to antibiotics, it's easy enough to find out which ones to use. (Medical diagnosis based on stuff you look up on the internet is really hard and requires a very well-trained bullshit filter, and you also have to know how to use the more authoritative sources properly. However, it's not impossible for people with training in things like law, information science, and biology. It just requires really good critical thinking skills. BTW, most MDs don't have that.)

    MDs are technicians. Most of them are like those B-average CS grads from low-ranked schools who can barely manage to write Java applications. If you know how to deal with a low-level technician, guide them properly, and stroke their ego in the right way, you can deal with an MD.

    • Paraphrased:

      I forgot that doctors are people, and that the bottom half are generally worthless, and the average ones are average. Also, diagnosing a rare problem is hard because it is unlikely to be a rare problem.

      I also forgot that doctors are the people who didn't tire of medical school shenanigans and change studies.

      And I bear a grudge because I didn't find that top notch House like genius who, despite being wrong every show, succeeds in the end.

      Finally, I have no idea why and how insurance, both medical

    • Dissolve the vitamin in question in DiMethylSulfOxide, apply topically. Or in water, and inject, snort or inhale. HTH
  • One of the big differences between archiving and backup is that in archiving I want to keep this exact version intact, if it changes on me it's an error while a backup takes a copy of whatever is now - maybe I wanted to edit that file. Unlike backups I think it's not about versioning, it's about maintaining one logical instance of the archive across different physical copies. Here's what I'm thinking, you create a system with three folders:

    archived
    to_archive
    to_trash

    The archive acts like a CD/DVD/BluRay and

  • for the huge and growing number of people on this planet. I get how wonderful it is that genetic medicine might allow us all to live to the age of 150, eliminate birth defects, and cure Aunt Millie's cancer. But really, just where are we going to put all the people whose lives we save and extend while at the same time the birth rate keeps climbing? How will we feed them? How will we maintain a viable biosphere in an era of rapidly accelerating extinctions?

    All that long term data will be meaningless if human

  • Something about securing genomes, coming from a guy called Newman? And not a single Jurassic Park joke after 79 posts?

    What a shame.

  • One can whine and wax poetic all one wants, but since we don't have a good archival format, the practical solution today is continual refresh of data: periodically copying data to fresh, and technologically up-to-date media. It's not sexy, but it does address three of the four points at the end of the linked piece (end-to-end data integrity, format migration and secondary media formats). The unaddressed point, access audit trails, makes no sense given the premise stated at the beginning of the piece that "N

  • Your body produces tons of it, and it can be stored and sequenced considerably longer than human lifespans, especially if care is taken to preserve it.
    • Nobody ever needs to know their complete genome and nobody ever will need to. Instead, you'll go to a doctor with a complaint and if they suspect a genetic component, they'll do a cheek swab and a quick test tuned to look for the particular genetic condition you might have. Or if something really exciting and common is discovered, you'll be offered an opportunity to get a new test to see if you're at risk for living to be 200. (You need to be warned because you probably won't have saved enough for near-p

  • I wrote an article about long-term storage *hardware* in CACM -- "The Forever Disc" [acm.org]. My favorite musing had to do with writing the data into a population's genetics, and letting redundancy correct errors/mutations..
  • There's a lot of work in this space from digital libraries for preservation of cultural heritage, state/official archives etc. Start with Open Archival Information Systems Reference Model (ISO-OAIS, an international standard originally from space agencies). PREMIS. Preservation metadata standard by US Library of Congress, but used around the world for digital assets. It works well with METS encoding standard and MIX technical metadata standard. PRONOM and DROID for format policy registries, monitoring
  • (Disclaimer: I am an Arvados developer)

    The Arvados project [arvados.org] is a free and open source (AGPLv3 and Apache v2) bioinformatics platform for genomic and biomedical data, designed to address precisely the issues raised in this article. Arvados features a 1) content addressed filesystem (blocks are addressed by a hash of their actual content rather some arbitrarily assigned identifier) which performs end-to-end data integrity checks , 2) fine-grained access controls, 3) a cluster scheduling system that tracks the

Vital papers will demonstrate their vitality by spontaneously moving from where you left them to where you can't find them.

Working...