Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Hardware IT

Data Deduplication Comparative Review 195

snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
This discussion has been archived. No new comments can be posted.

Data Deduplication Comparative Review

Comments Filter:
  • Not enough products (Score:3, Interesting)

    by ischorr ( 657205 ) on Wednesday September 15, 2010 @07:24PM (#33594178)

    Odd that if they reviewed this class of products they didn't review the most common deduping NAS/SAN applicance - the EMC NS-series (particularly NS20).

  • Re:Wrong layer (Score:3, Interesting)

    by bersl2 ( 689221 ) on Wednesday September 15, 2010 @07:24PM (#33594184) Journal

    No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

  • Re:Wrong layer (Score:3, Interesting)

    by PCM2 ( 4486 ) on Wednesday September 15, 2010 @07:31PM (#33594236) Homepage

    The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

    But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion.

  • Re:Wrong layer (Score:3, Interesting)

    by dougmc ( 70836 ) <dougmc+slashdot@frenzied.us> on Wednesday September 15, 2010 @07:45PM (#33594360) Homepage

    But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion

    This isn't a new thing -- it's a tried and true backup strategy, and it's quite effective at making your backup tapes go further. It increases the complexity of the backup setup, but it's mostly transparent to the user beyond the saved space.

    As for doing it at the file level rather than the block level, yes, that makes sense, but the block level does too. Think of a massive database file where only a few rows in a table changed, or a massive log file that only had some stuff appended to the end.

  • This is new? (Score:3, Interesting)

    by Angst Badger ( 8636 ) on Wednesday September 15, 2010 @07:48PM (#33594384)

    Didn't Plan 9's filesystem combine journaling and block-level de-duplication years ago?

  • by Anonymous Coward on Wednesday September 15, 2010 @07:57PM (#33594460)
    http://www.opendedup.org/ [opendedup.org]
  • Re:Wrong layer (Score:1, Interesting)

    by Anonymous Coward on Wednesday September 15, 2010 @08:17PM (#33594624)

    This technology is not just deduplicated backups, this is deduplicated STORAGE. Big difference. Combine with a SAN that has thin provisioning and automatic on the fly tiering between cache, SSD, FC, and some ATA disks and you can have a decent cost effective setup. Oddly, the cost per GB is about the same but you buy less and get fast and slow disks and it also has a lot of integrated DR features. I'll know how it all works in a few months, we are about two months away from a rolling upgrade of several Clariion CX3-80's to CX4's. It looked really good in the lab ;)

    Although we will have to increase our MPLS bandwidth, we will also be getting rid of tapes. I know people claim tapes are cheap but even with the great backup software setup and as automated as possible, you still have people on the ground loading and unloading, you pay for the Iron Mountain or Recall trucks, and you are paying of dearly for the tape hardware. We have older StorageTek SL500's of various size. Those bitches are can cost like 250K with a support contract and you are pushing data over your network or best case over your fiber network every night. Need to do a recovery? Call Iron Mountain and wait a few hours for the tapes to arrive. blah..

    I guess ever situation is different but for us, getting rid of tape, retiring our older CX3-80's and migrating to a CX4 with more features was a sound decision over keeping our existing setup. The ROI is less then 2 years and the we can use the additional features immediately.

    Kind of unrelated but I'd like to get rid of FC and move to 10GB iSCSI or FCoE but I guess I'm happy with the intermediate steps for now.

  • Re:Wrong layer (Score:3, Interesting)

    by hoggoth ( 414195 ) on Wednesday September 15, 2010 @08:51PM (#33594980) Journal

    > Getting it onto my linux box, now.. there's the rub

    So don't put it on Linux. Set up a Solaris or Nexenta box. I just did it. I installed a Nexenta server with 1TB of mirrored, checksummed storage in 15 minutes. I wrote it up here http://petertheobald.blogspot.com/ [blogspot.com] - it was extremely easy. Now all of my computers back up to the Nexenta server. All of my media is on it. I have daily snapshots of everything at almost no cost in disk storage.

  • Re:Wrong layer (Score:4, Interesting)

    by dgatwood ( 11270 ) on Wednesday September 15, 2010 @09:02PM (#33595078) Homepage Journal

    I think it depends on which scheme you're talking about.

    Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).

    De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.

    What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.

    Does that answer the question?

  • by immortalpob ( 847008 ) on Wednesday September 15, 2010 @09:49PM (#33595490)
    You are missing his point. On a non-deduplicated system if one block goes bad you lose one file, on a deduplicated system you can lose any number of files due to one bad block. It gets worse when you consider the panacea of non-backup deduplication, consider all of your servers are VMs and reside on the same deduplicated storage, one bad block can take them ALL DOWN. Now admittedly any dedupe solution will sit on some type of raid, however there is still the possibility of something terrible, and this is made worse by the likelihood of a URE during a raid-5 rebuild.
  • Re:Wrong layer (Score:3, Interesting)

    by drsmithy ( 35869 ) <drsmithy@gmail. c o m> on Wednesday September 15, 2010 @11:44PM (#33596192)

    Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.

    No, because compression is limited to a single dataset. Deduplication can act across multiple datasets (assuming they're all on the same underlying block device).

    Consider an example with 4 identical files of 10MB in 4 different locations on a drive, that cat be compressed at 50%.

    "Logical" space used is 40MB.
    Using compression, they will fit into 20MB.
    Using dedupe, they will fit somewhere in between 5MB and 10MB.
    Using dedupe and compression, they will fit into ~5MB (probably a bit less).

    It doesn't really negate the need for good housekeeping routines, nor good programming. Do you really want 100 copies of record X, or would one suffice?

    Far better to let the computer do the heavy lifting, than trying to impose partial order on an inherently chaotic situation.

    Not to mention that the three textbook scenarios where dedupe really shines are backups, email and virtual machines, none of which can really be helped by "better housekeeping".

Prediction is very difficult, especially of the future. - Niels Bohr

Working...