Forgot your password?
typodupeerror
Data Storage Hardware IT

Data Deduplication Comparative Review 195

Posted by samzenpus
from the a-little-order-please dept.
snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
This discussion has been archived. No new comments can be posted.

Data Deduplication Comparative Review

Comments Filter:
  • Second post (Score:2, Funny)

    by Anonymous Coward

    Same as the first.

  • Wrong layer (Score:5, Insightful)

    by Hatta (162192) on Wednesday September 15, 2010 @07:15PM (#33594088) Journal

    Filesystems should be doing this.

    • Re: (Score:3, Interesting)

      by bersl2 (689221)

      No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

      • Re: (Score:3, Interesting)

        by PCM2 (4486)

        The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

        But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion.

        • Re: (Score:3, Interesting)

          by dougmc (70836)

          But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion

          This isn't a new thing -- it's a tried and true backup strategy, and it's quite effective at making your backup tapes go further. It increases the complexity of the backup setup, but it's mostly transparent to the user beyond the saved space.

          As for doing it at the file level rather than the block level, yes, that makes sense, but the block level does too. Think of a massive database file where only a few rows in a table changed, or a massive log file that only had some stuff appended to the end.

      • by hawguy (1600213)

        No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

        Why do you want multiple independent copies of a file? If you're doing it because your disk storage system is so flakey that you aren't sure you can read the file, deduplication policy is not what you need -- you need a more reliable storage system and backups.

        Most disks have a fine line between throwing random unrecoverable read errors and failing completely, so there's little value in having multiple copies of the same file on the same physical disk. (and most storage systems will have automatically repla

    • Filesystems should be doing this.

      The one on your desktop machine, or the primary NAS storage that you access shared data from, or the backup server that ends up getting it all anyway? You see, this is a shared database problem. If your local filesystem does this, then it has to 'share' knowledge of all the unique blocklets with every other server/filesystem that wishes to share in this compressed file space. De-duplication is a means of compression that works across many filesystems - or at least it can be, if it is properly implemented

      • by icebike (68054)

        Well in the end, does not the filesystem running on the device end up controlling the actual reads and writes regardless of whether the file is shared across the network or across the world?

        My take is that there is not much to justify the claim that this should be in the filesystem vs the hardware. If you don't want to de-duplicate some data (for what ever reason) then you don't put it on that type of storage.

        But it seems to me that a hardware approach is a perfectly reasonable layer to do this. It elimin

    • by JWSmythe (446288)

      Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.

      Pick your platform. I know in both Linux and Windows, there have been compressed filesystems for quite some time.

      It doesn't really negate the need for good housekeeping routines, nor good programming. Do you really want 100 copies of record X, or would one suffice? S

      • by icebike (68054)

        True, compression does a lot of this.

        But De-duplication does that and goes one step further.

        Multiple copies of the same block of data (either entire files or portions of files) that match even if stored in separate directories can be replaced by a pointer to a single copy of that file or block.

        How many times would, say, the boilerplate at the bottom of a lawyer/doctor/accountant's file systems appear verbatim in every single document filed in every single directory?

        A proper system might allow you to have ju

        • by JWSmythe (446288)

          How many times would, say, the boilerplate at the bottom of a lawyer/doctor/accountant's file systems appear verbatim in every single document filed in every single directory?

          I won't argue about that. I'm still shocked to see the bad housekeeping practices on various servers I've worked on. No, really, you don't need site_old site_back, site_backup, site_backup_1988. and site_backup_y2k. Has anyone even considered getting rid of those? Nope. They're kept "just in case". What "just in cas

      • by dgatwood (11270)

        Yes and no. Compression generally does involve reduction of duplication of information in one form or another, but does so at a finer level of granularity. With a compressed filesystem, you'll generally see compression of the data within a block, maybe across multiple blocks to some degree, but for the most part, you'd expect the lookup tables to be most efficient at compressing when they are employed on a per-file basis. The more data that shares a single compression table, the closer your input gets to

        • by icebike (68054)

          Thus, the deduplication algorithms use various techniques to determine how similar two files are before deciding to try to express one in terms of the other.

          But I understood de-duplication to be not concerned with files at all. Simply blocks of data on the device.

          As such my might de-duplicate the boiler plate out of a couple hundred thousand word documents scattered across many different directories.

          Is that not the case? Are they not yet that sophisticated?

          • Re:Wrong layer (Score:4, Interesting)

            by dgatwood (11270) on Wednesday September 15, 2010 @09:02PM (#33595078) Journal

            I think it depends on which scheme you're talking about.

            Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).

            De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.

            What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.

            Does that answer the question?

            • Re: (Score:3, Funny)

              by StikyPad (445176)

              Sounds like what we need is a giant table of all possible byte values up to 2^n length, then we can just provide the index to this master table instead of the data itself. I call this proposal the storage-storage tradeoff where, in exchange for requiring large amounts of storage, we require even more storage. I'll even throw in the extra time requirements for free.

          • by drsmithy (35869)

            But I understood de-duplication to be not concerned with files at all. Simply blocks of data on the device.

            It depends.

            Simplistic dedupe schemes only operate at the file level. More advanced schemes operate at the block/cluster level.

      • Re: (Score:3, Interesting)

        by drsmithy (35869)

        Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.

        No, because compression is limited to a single dataset. Deduplication can act across multiple datasets (assuming they're all on the same underlying block device).

        Consider an example with 4 identical files of 10MB in 4 different locations on a drive, that cat be compressed at 50%.

        "Logical" space used is 40MB.
        Usi

    • Data de duplication is mostly being used for virtual servers. So no this is being done at the right level, the block level.

    • Re:Wrong layer (Score:4, Insightful)

      by drsmithy (35869) <drsmithy@@@gmail...com> on Wednesday September 15, 2010 @11:41PM (#33596174)

      Filesystems should be doing this.

      No, block devices should be doing this. Then you get the benefits regardless of which filesystem you want to layer on top.

  • by leathered (780018) on Wednesday September 15, 2010 @07:19PM (#33594114)

    The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.

    NetApp is neat and all but horribly overpriced.

    • by ischorr (657205)

      I assume they didn't spend the money only for dedupe? That box has a whole lot of features.

    • Re: (Score:3, Informative)

      by hardburn (141468)

      Was it near the end of the fiscal year? Good department managers know that if they use up their full budget, then it's harder to argue for a budget cut next year. Managers will sometimes blow any excess funds at the end of the year on things like this for that very reason.

      • Re: (Score:3, Insightful)

        by TheRaven64 (641858)
        No, good department managers don't know that. Department managers in companies with bad senior management know that. Companies with competent senior management are willing to increase the budgets for departments that have shown that they are fiscally responsible, and cut the budgets or fire the department heads of others.
    • However they have a ton of features including extremely high performance and reliability. For example they monitor your unit and if a drive fails, they'll send one you next day air. Sometimes the first you know of the failure is a disk shows up at your office.

      Don't get me wrong, they aren't the only way to go, we have a much cheaper storage solution for less critical data, but the people who think dropping a bunch of disks in a Linux server gives you the same thing as a NetApp for less cost are fooling them

      • by h4rr4r (612664)

        You can just have nagios monitoring for errors and even order a drive off amazon if you really wanted. NetApps have a lot of neat features, mailing you drives are not really one of them.

        • Ya it is (Score:4, Insightful)

          by Sycraft-fu (314770) on Wednesday September 15, 2010 @09:20PM (#33595246)

          Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.

          Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.

          You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."

          Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.

          Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.

          • Re: (Score:3, Insightful)

            by h4rr4r (612664)

            I mean have the nagios server order the drive without any human intervention.

            Also if it was really critical you would keep several disks ready to go on site. You know for when you can't wait for next day. Also like netapp you too can have many hot spares in the volume.

            If you have problems with people not noticing or reacting to alerts you need to fire them.

            • Re: (Score:2, Insightful)

              by Anonymous Coward
              I'll but in and say that firing people is a piss poor way to fix problems unless you've made very sure that the person in question needs to go. What you do is find out what happened if an alert goes unnoticed and make a change that removes the root cause of that failure. That may be that you have to let go of the guy doing drugs in the corner, but it may also be that your hardware issues alerts in a way that it is easy to miss. You may also realize that perhaps an alert happens only once a year, and in that
            • Re: (Score:3, Insightful)

              Developing a monitoring system for a complicated piece of storage that reacts properly to every possible failure mode is a massive undertaking. It will take a lot of time just to figure out everything that you need to monitor, and the possible values for them during normal operation; let alone actually test that your system correctly detects and responds to every possibility.

              If your business is providing SAN management/support services, then I can see this as being worthwhile. It's a massive investment in t

          • by h4rr4r (612664)

            You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok.

            Then what you want is redundancy, because downtime and loss of data are guarantees in life. The real service NetApp provides is letting companies hire MCSEs and be ok with the job they do. They spend money to outsource this part of their IT, which is fine. Just do not pretend that they are doing anything else.

          • by dbIII (701233)

            You put aside the tech-tough guy attitude of "I can manage it all myself,"

            And then you wonder if the guys you outsourced it to care enough to do things as advertised.
            It's the attitude of them losing a small contract versus you losing your job. Unless you are a HUGE customer you have to assume their care factor is zero and have at least something to fall back on if they take too long or don't come through at all. Last weeks backup will be missing things but if it's on site and you can get stuff from it NO

        • by drsmithy (35869)

          You can just have nagios monitoring for errors and even order a drive off amazon if you really wanted.

          Not even touching on all the things that could go wrong with this (and there are many), the best response time you're going to be looking at for this is ~12 hours, and that's only in ideal circumstances.

          NetApp will have a replacement drive on your doorstep in 4-8 hours, often less.

    • by alen (225700)

      No shit

      We have deduce and plain old tape. 20 lot-4 tapes cost $700. That's 20 - 60 terabytes depending on compression.

      We also pay$20,000 a year in support for a dedupe software app. Plus the disk, servers and power to keep it running and you have to buy at least 2 since if your os crashes then your data is gone

      Cheap disk my ass

      The tape backup was a little pricey at first but the tapes hold so much and are so fast that we hardly buy any more tape. Like we used to blow $25,000 a year or more for dlt tape

    • by mlts (1038732) *

      The Netapp box does a lot more than deduping:

      1: The newer models come with 512GB-1TB of SSD, and automatically place data either on the SSD, the FC drives, or the SATA platters depending on how much it is used. If the chunk of data is used all the time, it sits on the SSD. This helps a lot with the bottleneck of a lot of machines needing to access the same data block with deduplication. This is different from other disk solutions, as the NetApp chooses the "tier" of disk for you. However, a lot of serv

    • by drsmithy (35869)

      The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.

      Where are you going to get three times as much storage for the same cost (well, actually it'd need to be a lot less to pay for all the additional physical and logical infrastructure) that has redundant controllers, FC, iSCSI, NFS, SMB, no-impact snapshotting, dedupe, replication and 24x7x4 support ?

  • Not enough products (Score:3, Interesting)

    by ischorr (657205) on Wednesday September 15, 2010 @07:24PM (#33594178)

    Odd that if they reviewed this class of products they didn't review the most common deduping NAS/SAN applicance - the EMC NS-series (particularly NS20).

    • I found it odd too, though they seem to be reviewing boxes that do dedup on live data, as opposed to backup streams. Appliances like the NS-series claim dedup percentages of 95%+, but they accomplish this seeming miracle when slowly changing datasets are backed up over and over (even differential backup systems usually do a full backup fairly regularly).
      • by ischorr (657205)

        I can't say that I've ever heard dedup percentage of 95% related to the NS series, which is very similar to the products in this article (NAS/SAN server that does dedupe on live data that lives on the array). Maybe you're confusing with products like Data Domain or Avamar or something?

    • Thirded. Data Domain (now part of EMC) really started the commercial use of this...

    • by alen (225700)

      If it's emc then you need to be a global fortune 10 company to afford it

      I used to joke that they are like crack dealers. The initial hardware is not that much, but they get you on the disk upgrades, licenses to go above some storage size, backend bandwith, etc

      • by ischorr (657205)

        The NS20 goes head-to-head with that NetApp box, so I'm not sure if that's true in this case (need to be fortune 10 to afford it). And from what I read a couple of days ago, it's the most commonly sold NAS product in this class...which is why I thought it was weird not to include it in the review. I'm curious what they would have said about it.

  • I can't wait until the Dilbert strip hits where the PHB does this across all their backups and deduplicates them all away, thinking he's just saved a ton of money on backup media.

    Redundancy can be a very good thing!

  • Are there any open-source filesystems that offer deduplication?

    It seems that the FS du-jour changes faster than any of the promised 'optional' features ever materialize.

    Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.

    Maybe I'm just jaded, but I've been a Linux user for 12 years now. Sometimes it feels like the names of the technologies are changing, but nothing ever gets 'finished'

    • Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.

      To do all these things, you have to change how data is stored on the disk and what information is present. When you do this, you necessarily create a new file system. These aren't simple features that you can just tack onto an existing file system.

      I suspect that one of these days we will be running the ext10 file system that has most of these features and evolved from ext3 in a methodical way, but it will in no way actually resemble ext3. There will always be other systems being developed to try out new

    • Re: (Score:2, Informative)

      by suutar (1860506)
      There's a few. I've read there's a patchset for ZFS on FUSE that can do deduplication; there's also opendedup [slashdot.org] and lessfs [lessfs.com]. The problem is that none of these has been around long enough to be considered bulletproof yet, and for a filesystem whose job is to play fast and loose with file contents in the name of space savings, that's kinda worrisome.
  • by jgreco (1542031) on Wednesday September 15, 2010 @07:48PM (#33594382)

    ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.

    • Re: (Score:3, Informative)

      by lisany (700361)

      Except NexentaStor (3.0.3) has an OpenSolaris upstream (which has gone away, by the way) kernel bug that hanged our Nexenta test box. Not a real good first impression.

      • by jgreco (1542031)

        I found a ton of stuff I didn't really care for with Nexenta. They've put some good effort into it, and it'd be a fine way to go if you wanted commercial support, but overall it doesn't really seem to fit our needs here. ZFS itself is a resource pig, but on the other hand, resources have become relatively cheap. It's not unthinkable to jam gigs of RAM in a storage server ... today. Five years ago, though, that would have been much more likely to be a deal-breaker.

  • This is new? (Score:3, Interesting)

    by Angst Badger (8636) on Wednesday September 15, 2010 @07:48PM (#33594384)

    Didn't Plan 9's filesystem combine journaling and block-level de-duplication years ago?

    • by BitZtream (692029)

      Plan 9 could have the cure for cancer too but still no one gives a shit about it.

      Dedup is a good 30 years old at least, if you want to point out that it isn't new.

      Only slashdotters and Linux children get excited at silly things like this.

  • by MyLongNickName (822545) on Wednesday September 15, 2010 @09:57PM (#33595544) Journal

    After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.

    I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.

    I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.

  • If the market leader isn't included in the review, I am wondering how worthy this report is.

"But this one goes to eleven." -- Nigel Tufnel

Working...