Researchers Achieve Storage Density of 2.2 Petabytes Per Gram of DNA 136
A reader sends news of researchers who encoded an MP3, a PDF, a JPG, and a TXT file into DNA, along with another file that explains the encoding. The researchers estimate the storage density of this technique at 2.2 petabytes per gram (abstract). "We knew we needed to make a code using only short strings of DNA, and to do it in such a way that creating a run of the same letter would be impossible. So we figured, let's break up the code into lots of overlapping fragments going in both directions, with indexing information showing where each fragment belongs in the overall code, and make a coding scheme that doesn't allow repeats. That way, you would have to have the same error on four different fragments for it to fail – and that would be very rare," said one of the study's authors. "We've created a code that's error tolerant using a molecular form we know will last in the right conditions for 10 000 years, or possibly longer," said another.
Please use a real unit of measure (Score:1)
How many Libraries of Congress is that?
Re:Please use a real unit of measure (Score:5, Interesting)
Re: (Score:3, Insightful)
Please wait until you sober up before posting again.
Re: (Score:1)
Re: (Score:2)
1/225.28 grams.
Re:Please use a real unit of measure (Score:5, Funny)
We should redefine the gram to match the amount of DNA it takes to store a LOC. Then people would have an easier time switching to metric.
Re: (Score:1)
Re: (Score:2)
Kneau Reeves movie "Johnny Mnemonic" (Score:1)
Re: (Score:1)
Review I read of that movie: "Keanu Reeves is miscast as someone with too much information in his head."
Re: (Score:2)
Does it run exFat 1.0? (Score:2)
I was actually trying to come up with a ReiserFS gag.
Re: (Score:2)
Re: (Score:1)
Latency and bandwidth? (Score:2, Insightful)
It's useless unless it's reasonably fast.
Re:Latency and bandwidth? (Score:5, Informative)
Huge latency and low bandwidth. From the abstract:
DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving
Re: (Score:2)
Until someone mistakes it for a snack and pops it into the microwave :P
Re: (Score:2)
I think a hard drive would fare at least as bad in that scenario.
Re: (Score:3)
Yet surely the hard drive is less likely to be mistake for a tasty snack?
Re:Latency and bandwidth? (Score:5, Interesting)
Not if it is for archival purposes, like Amazon storage.
Re: (Score:1)
Re: (Score:2)
The latency must be horrible though!
Re: (Score:1)
Write performance is awesome. I suppose I'd better check read performance again. ... ... ...
Hmm, that disk array doesn't seem to help much here. I want a refund!!
Re:Latency and bandwidth? (Score:4, Interesting)
It's not useless. One interesting part is how long it holds up in storage. There isn't any effective storage medium available today that lasts for 10k+ years. Another is how high the information density is.
Re: (Score:2)
Re:Latency and bandwidth? (Score:5, Insightful)
No, it's only useless for the specific application you're imagining, not "useless" in general. A jet airliner may be really, really fast in comparison to my car, but is useless if my task is to get to the grocery store for milk and eggs. That doesn't invalidate the usefulness of jet airliners.
Re: (Score:2)
If it takes 1 day per byte, then sorry, it's too slow for any use.
Re: (Score:2)
If it takes 1 day per byte, then sorry, it's too slow for any use.
Not quite. You could, for example, store a daily temperature reading in one byte per day.
Re: (Score:2)
And how is that cost-effective compared to other more flexible solutions?
Re: (Score:3)
Re: (Score:1)
If I used raid and built a giant striped array weighing in at only 1 million grams (1000 kg or 2200 lb, less than your average car)...
2.2 million petabytes = 2.2 zettabytes
With a write speed of 1 MiB per day. (I hate it when they started doing the MiB, no more powers of 2)
Think of what we used to put on 5.25in floppies at 360k per disk.
Re: (Score:2)
Re: (Score:2)
Really? a very large storage medium that does not degrade (theoretically) for 10K years....I see no use here....you are right.
Re: (Score:2)
If it takes more than 10k years to actually write stuff to it, surely you can see the problem?
Re: (Score:2)
Don't confuse the tech used to read and write the data with the tech to store the data....the benefits of the storage medium are promising enough mean we should invest in the research needed to read and write to them efficiently.
Memory upgrade of the future (Score:2, Funny)
Re: (Score:1)
"Hold on, mum, the internet hasn't quite finished downloading into my hair yet"
Oh yeah, I can't wait :)
Re:Memory upgrade of the future (Score:5, Funny)
So my "thumb drive" will really be my thumb?
Re: (Score:2)
Check out the character QiRia in "The Hydrogen Sonata" by Iain Banks. The character is 10,000 years old and has converted much of his body into additional storage for memories.
Re: (Score:1)
5 blades used in an array.
Where's the important information? (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
"It's people. WD Green [wdc.com] is made out of people."
If anyone from Western Digital or MGM/UA is listening, it's PARODY. Thank you.
New error correction scheme? (Score:5, Insightful)
I understand they wanted the overall system to be fault tolerant, but it might be better to leave that part to established computer science. I understand DNA might be uniquely prone to certain types of errors or reading problems - but there's a lot of computer science theory (and practice) established here that would likely make the overall system more robust than what looks like a fairly simple redundancy scheme.
Or maybe not (Score:2)
It could be they are already using a fancier scheme - it's hard to tell what's real details of their method, and what's pop-sci "summary". So I apologize if I'm not giving them deserved credit here.
Re: (Score:2)
same error on four different fragments for it to fail
swap usenet article for dna fragment and right there, they've done a crappy job of reinventing the PAR2 file.
There's probably some analogies from the tape sort/merge era although that's slightly before my time.
SSSS shamirs (aka the S in RSA) secret sharing system just tell it how many slices you want, and how many slices you need present and error free to decrypt, and you're done. Using it for redundancy in this case rather than security.
ECC is a pretty well worn path in CS.
A real hack would be writing DNA
Call me when they can encode video... (Score:5, Funny)
I can't wait to see what happens when a video stored on DNA goes viral...
*ducks*
Re:Call me when they can encode video... (Score:4, Informative)
Well, this smbc comic [smbc-comics.com] addresses that, except that it's stored in bacterial DNA.
Re: (Score:1)
*ducks*
*geese*
another flap (Score:2)
ok, those were both really fowl.
0.0005% of potential storage (Score:5, Informative)
Each DNA nucleotide has a molecular weight of about 150. So a gram of DNA should contain about about 6e23/150 = 4e21 bases. At two bits per base, that is 1e21 bytes. These guys are getting 2e15. So, in theory, they are getting about a half millionth of the potential storage, or 0.0005%.
Re: (Score:2)
Re: (Score:1)
Re:0.0005% of potential storage (Score:4, Informative)
These are artificial DNA oligos, so there shouldn't be any of those sorts of modifications. However, a figure of MW 150 per base leaves out the sugar-phosphate backbone, and doesn't account for this being double-stranded DNA. Molecular weight per base pair should be around 700 g/mol..
Of course, that's really nitpicking, What really accounts for the low ratio of achieved versus theoretical is that they made "~1.2x10^7 copies of each DNA string."
They go on to explain in the supplementary materials that "With the latest platform, up to 244,000 unique sequences are synthesized in parallel and delivered as ~1-10 pmol pools of oligos... In our experiment, three runs were used to synthesize 153,335 designs, leading to the higher figure of ~12-120x10^6 (= 3-30 x 10^-12 x6.02x10^23/153,335)." A more accurate assessment of their coding scheme is that they used 153335 strings of 117 nucleotides ( 17940195 total) to encode 5165800 bits of Shannon information, or about 0.29 bits per nucleotide.
The fact they made ten million copies of each string is more of a current technical limitation of DNA oligo synthesis and automated DNA sequencing than an limit on the efficiency of the encoding itself. With the appropriate technology, you could make a few thousand copies (for appropriate error correction) instead of ten million, and your mass of DNA would be in the femtograms instead of hundreds of picograms.
Where does it all end? (Score:2)
This seems like an amazing development, but just today we've had a story about Monsanto and how well their error correction is going despite haivng the best in Western thinking availalble to them. Why should we trust that IBM's procedures are any better?
Re:Where does it all end? (Score:5, Insightful)
Hard to say whether we should or shouldn't. But it's worth noting that there are at least two possible important differences between IBM's experiments and Monsanto's:
1) Monsanto's experiments are often self replicating.
2) IBM isn't trying to sell us MP3 files as food.
Re: (Score:2)
Okay, but imagine if they did encode MP3 files as food. And then people started sharing that (self replicating?) data as food.
Just think: Coming soon to a courthouse in East Texas: Monsanto vs. the RIAA...
Re: (Score:1)
Redundancy (Score:5, Insightful)
It's 2.2 petabytes per gram, but only if you don't mind that it contains a billion copies of the same 2.2 megabytes. Making lots of copies of a short DNA sequence is easy. Making a whole gram of unique DNA sequences is much, much harder. What's the non-redundant storage density of this process?
Re: (Score:2)
Re: (Score:2)
Something tells me you dont understand how RAID levels are designated.
Hint: Noone in their right mind would run something named "RAID 1000000000", unless they didnt care in the least whether their data was retrievable.
Hint 2: It has an array failure rate of ~14% over 3 years, assuming standard drive failure rate of 5% over 3 years. ( 1 - ( 0.95 ) ^ 9 ) ^ 2
Re: (Score:2)
Correction: Was assuming a 9-disk RAID 0. I think the actual failure rate for 9-levels of nested RAID0, RAID1'd, would be
99.99999999921383102043346279157%.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
It's perfectly acceptable to store multiple copies of the same data. You just have to divide your quoted storage density by the number of copies. You don't say a RAID1 array made of 2 3TB drives is a 6TB array, and you shouldn't say this is 2.2PB/gram either.
How do you... (Score:1)
Re: (Score:1)
DNA isn't "alive," it's a really big molecule.
Re: (Score:2)
Synergy?
"very rare"? (Score:2, Insightful)
How rare is "very rare"? If they have that 2.2 petabyte gram of sotrage, and "rare" means 0.0001% of the time, that's still 9 billion failures in your archived data.
Re: (Score:2)
So uhhh....parity?
Human data carriers? (Score:1)
Ok.. digital data on DNA.... (Score:2)
So, while I realize that the intet here is not to put it inside a living organism.... some part of me wants to know what would happen if the data for various windows malware packages was encoded, and injected into bacterial hosts.
Think of all the new diseases that could come about from pure happenstance, coinicidence, and murphy's law!
Kind of a "throw stuff at the wall and see what sticks" silliness side effect of using DNA for data storage.
Re: (Score:2)
You mean like Snow Crash [amazon.com]?
Highly unlikely, but I can't help but to wonder: (Score:2)
I hate it when my .DOC mutates into a .PSD (Score:1)
Where's the de-mutation program for that?
Major challenge: Retrieval and storage (Score:4, Interesting)
Okay, storing is "solved". How about retrieval? Especially random access retrieval that are simple and fast (relatively speaking) that allow such storage medium to be practical? Certainly not DNA sequencing that can take weeks to complete?
The second problem: DNA denature and fragment at room temperature. Certainly a -80C lab freezer for storage wouldn't be practical.
Third problem: DNA secondary and tertiary structure. The coding scheme must also solves the problem of DNA tendency to make secondary structure (like hairpin) or tertiary structure (like super-coil) that can hamper reading / access to the information. I think this is the reason why the storage uses short sequences. But short DNA sequences like the one proposed (~100 bp, from the figure) could still make such structures.
Transfers (Score:2)
"That was the best sex ever and BTW, I just gave you copies all my videos".
Re: (Score:1)
That's a 9 month transfer. Those videos are old already...
Uh oh. (Score:3)
You copied an MP3? Expect to be sued by the RIAA and their European buddies.
prior art (Score:2)
The question is: What if other already used similar method to send messages to us? How would you find that out? Anybody tried to find it out? Considering the possibility we are not alone...
what's in our DNA then? (Score:1)
Should we start checking our own DNA for encoded files from our deep ancestors?
But DNA has a Half-Life of 521 years (Score:1)
Slashdot told me so [slashdot.org]
10,000 years my ass....
Great. (Score:2)
So now I can store my entire porn collection in one spurt.
Science!
Prior art (Score:1)
DNA reading/righting rather slow (Score:2)
At Last! (Score:2)
This is Fucking Awesome (Score:1)
Sorry, this is just the best news I've read in ages. Fucking AWESOME!!!
Re: (Score:2)
Porn is already encoded in DNA. Sometimes a bit of silicone is added.