Data Deduplication Comparative Review 195
snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
Second post (Score:2, Funny)
Same as the first.
Wrong layer (Score:5, Insightful)
Filesystems should be doing this.
Re: (Score:3, Interesting)
No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.
Re: (Score:3, Interesting)
The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.
But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion.
Re: (Score:3, Interesting)
But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion
This isn't a new thing -- it's a tried and true backup strategy, and it's quite effective at making your backup tapes go further. It increases the complexity of the backup setup, but it's mostly transparent to the user beyond the saved space.
As for doing it at the file level rather than the block level, yes, that makes sense, but the block level does too. Think of a massive database file where only a few rows in a table changed, or a massive log file that only had some stuff appended to the end.
Re: (Score:2)
No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.
Why do you want multiple independent copies of a file? If you're doing it because your disk storage system is so flakey that you aren't sure you can read the file, deduplication policy is not what you need -- you need a more reliable storage system and backups.
Most disks have a fine line between throwing random unrecoverable read errors and failing completely, so there's little value in having multiple copies of the same file on the same physical disk. (and most storage systems will have automatically repla
Which filesystem should be doing this??? (Score:2, Insightful)
Filesystems should be doing this.
The one on your desktop machine, or the primary NAS storage that you access shared data from, or the backup server that ends up getting it all anyway? You see, this is a shared database problem. If your local filesystem does this, then it has to 'share' knowledge of all the unique blocklets with every other server/filesystem that wishes to share in this compressed file space. De-duplication is a means of compression that works across many filesystems - or at least it can be, if it is properly implemented
Re: (Score:2)
Well in the end, does not the filesystem running on the device end up controlling the actual reads and writes regardless of whether the file is shared across the network or across the world?
My take is that there is not much to justify the claim that this should be in the filesystem vs the hardware. If you don't want to de-duplicate some data (for what ever reason) then you don't put it on that type of storage.
But it seems to me that a hardware approach is a perfectly reasonable layer to do this. It elimin
Re: (Score:2)
Rarely is it useful on the same local storage. Keeping live copies offsite or in separate hardware is a good strategy but on the same hardware is just wasteful.
Re: (Score:2)
Ah, so you want to go to other hardware to restore a file that you have a snapshot of on your local hardware? And that fileset happens to be oh say a few hundred gigabytes. Out of curiosity, do you manage production fileservers with end users that are able to do stupid things?
Re: (Score:2)
You and Vancorps are talking about two different things. Deduplication (whether done by the filesystem or the storage system) doesn't preclude having snapshots.
Vancorps was talking about the futility of keeping multiple copies of files on the same storage device as an aid to recovering corrupt data. He was not arguing that regular snapshots should not be made, just that redundant data could be deduped away without sacrificing any real measure of file integrity.
Re: (Score:2)
Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.
Pick your platform. I know in both Linux and Windows, there have been compressed filesystems for quite some time.
It doesn't really negate the need for good housekeeping routines, nor good programming. Do you really want 100 copies of record X, or would one suffice? S
Re: (Score:2)
True, compression does a lot of this.
But De-duplication does that and goes one step further.
Multiple copies of the same block of data (either entire files or portions of files) that match even if stored in separate directories can be replaced by a pointer to a single copy of that file or block.
How many times would, say, the boilerplate at the bottom of a lawyer/doctor/accountant's file systems appear verbatim in every single document filed in every single directory?
A proper system might allow you to have ju
Re: (Score:2)
I won't argue about that. I'm still shocked to see the bad housekeeping practices on various servers I've worked on. No, really, you don't need site_old site_back, site_backup, site_backup_1988. and site_backup_y2k. Has anyone even considered getting rid of those? Nope. They're kept "just in case". What "just in cas
Re: (Score:2)
Yes and no. Compression generally does involve reduction of duplication of information in one form or another, but does so at a finer level of granularity. With a compressed filesystem, you'll generally see compression of the data within a block, maybe across multiple blocks to some degree, but for the most part, you'd expect the lookup tables to be most efficient at compressing when they are employed on a per-file basis. The more data that shares a single compression table, the closer your input gets to
Re: (Score:2)
Thus, the deduplication algorithms use various techniques to determine how similar two files are before deciding to try to express one in terms of the other.
But I understood de-duplication to be not concerned with files at all. Simply blocks of data on the device.
As such my might de-duplicate the boiler plate out of a couple hundred thousand word documents scattered across many different directories.
Is that not the case? Are they not yet that sophisticated?
Re:Wrong layer (Score:4, Interesting)
I think it depends on which scheme you're talking about.
Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).
De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.
What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.
Does that answer the question?
Re: (Score:3, Funny)
Sounds like what we need is a giant table of all possible byte values up to 2^n length, then we can just provide the index to this master table instead of the data itself. I call this proposal the storage-storage tradeoff where, in exchange for requiring large amounts of storage, we require even more storage. I'll even throw in the extra time requirements for free.
Re: (Score:2)
But I understood de-duplication to be not concerned with files at all. Simply blocks of data on the device.
It depends.
Simplistic dedupe schemes only operate at the file level. More advanced schemes operate at the block/cluster level.
Re: (Score:3, Interesting)
Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.
No, because compression is limited to a single dataset. Deduplication can act across multiple datasets (assuming they're all on the same underlying block device).
Consider an example with 4 identical files of 10MB in 4 different locations on a drive, that cat be compressed at 50%.
"Logical" space used is 40MB.
Usi
Re: (Score:2)
Data de duplication is mostly being used for virtual servers. So no this is being done at the right level, the block level.
Re:Wrong layer (Score:4, Insightful)
Filesystems should be doing this.
No, block devices should be doing this. Then you get the benefits regardless of which filesystem you want to layer on top.
Re: (Score:3, Informative)
It's not fully automatic, I assume? Since that would cause a major slowdown.
For manual dedupes, btrfs can do that as well, and a part of vserver patchset (not related to the main functionality) includes a hack that works for most Unix filesystems.
Re:Wrong layer (Score:5, Informative)
It is fully automatic and it's not that much of a slow down. The reduced IO might actual provide a performance boost.
Re:Wrong layer (Score:5, Informative)
Re: (Score:3, Interesting)
> Getting it onto my linux box, now.. there's the rub
So don't put it on Linux. Set up a Solaris or Nexenta box. I just did it. I installed a Nexenta server with 1TB of mirrored, checksummed storage in 15 minutes. I wrote it up here http://petertheobald.blogspot.com/ [blogspot.com] - it was extremely easy. Now all of my computers back up to the Nexenta server. All of my media is on it. I have daily snapshots of everything at almost no cost in disk storage.
Re: (Score:3, Insightful)
Open Solaris is dead, and there are kernel bugs in the latest version, so good luck with that. I looked at doing it at one time and due to fears about Opensolaris I stayed away. I consider myself lucky.
Re: (Score:2)
Google luck on finding solutions to your problems that are based on logic and rational thinking, I doubt you can pull it off judging by your statements so far.
I dunno, I found it pretty easy. I got some interesting results too:
Critical Thinking - HowTo.Lifehack [google.com]
Virgo free weekly horoscope [google.com]
Actually that's pretty funny.
Maybe you're right, maybe it is hard to google luck on finding solutions to your problems that are based on logic and rational thinking.
Re: (Score:3, Insightful)
Sweet, thanks for the pointer. I was also concerned about the death of OpenSolaris but it sounds like Nexenta may be just what I want.
Nexenta is built off Open Solaris and is, therefore, also dead - though it may take longer for the thrashing to stop.
Re: (Score:3, Informative)
Re: (Score:3, Informative)
The latest stable version of zfs-fuse, 0.6.9, includes pool version 23 which has dedup support. Haven't tried it out yet, though.
http://zfs-fuse.net/releases/0.6.9 [zfs-fuse.net]
Re: (Score:2)
I thought everything that used FUSE was slow as hell, is this not true?
Re: (Score:2)
Well, I don't think the userspace file system layer is the main slowdown on my file server box (using old hardware + a slower ethernet card; for a background backup system, it works), so I'm not speaking from experience here. I've heard the general idea is a 30-60 % slowdown is standard, depending on the operation.
Re: (Score:2)
Actually, this feature is a recent addition to ZFS, and it's the main reason I'm interested in putting ZFS on my file server.
You'll probably be disappointed. Dedupe savings for the kind of stuff you'd typically find on a home file server are miniscule.
Don't forget to weigh in the cost (Score:3, Informative)
The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.
NetApp is neat and all but horribly overpriced.
Re: (Score:2)
I assume they didn't spend the money only for dedupe? That box has a whole lot of features.
Re: (Score:3, Informative)
Was it near the end of the fiscal year? Good department managers know that if they use up their full budget, then it's harder to argue for a budget cut next year. Managers will sometimes blow any excess funds at the end of the year on things like this for that very reason.
Re: (Score:3, Insightful)
If you get it just for dedupe maybe (Score:2)
However they have a ton of features including extremely high performance and reliability. For example they monitor your unit and if a drive fails, they'll send one you next day air. Sometimes the first you know of the failure is a disk shows up at your office.
Don't get me wrong, they aren't the only way to go, we have a much cheaper storage solution for less critical data, but the people who think dropping a bunch of disks in a Linux server gives you the same thing as a NetApp for less cost are fooling them
Re: (Score:2)
You can just have nagios monitoring for errors and even order a drive off amazon if you really wanted. NetApps have a lot of neat features, mailing you drives are not really one of them.
Ya it is (Score:4, Insightful)
Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.
Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.
You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."
Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.
Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.
Re: (Score:3, Insightful)
I mean have the nagios server order the drive without any human intervention.
Also if it was really critical you would keep several disks ready to go on site. You know for when you can't wait for next day. Also like netapp you too can have many hot spares in the volume.
If you have problems with people not noticing or reacting to alerts you need to fire them.
Re: (Score:2, Insightful)
Re: (Score:3, Insightful)
Developing a monitoring system for a complicated piece of storage that reacts properly to every possible failure mode is a massive undertaking. It will take a lot of time just to figure out everything that you need to monitor, and the possible values for them during normal operation; let alone actually test that your system correctly detects and responds to every possibility.
If your business is providing SAN management/support services, then I can see this as being worthwhile. It's a massive investment in t
Re: (Score:2)
You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok.
Then what you want is redundancy, because downtime and loss of data are guarantees in life. The real service NetApp provides is letting companies hire MCSEs and be ok with the job they do. They spend money to outsource this part of their IT, which is fine. Just do not pretend that they are doing anything else.
Re: (Score:2)
And then you wonder if the guys you outsourced it to care enough to do things as advertised.
It's the attitude of them losing a small contract versus you losing your job. Unless you are a HUGE customer you have to assume their care factor is zero and have at least something to fall back on if they take too long or don't come through at all. Last weeks backup will be missing things but if it's on site and you can get stuff from it NO
Re: (Score:2)
Isn't not just small to medium sized businesses; most tech companies, even really huge ones, don't buy this kind of enterprise equipment. You won't find any of it at Google or Amazon, for example, even though they are quite large.
Re: (Score:2)
You won't find any of it at Google or Amazon, for example, even though they are quite large.
Have no illusions, though. The Google and Amazon solutions are neither cheap nor easy to implement. They rely on top-notch engineers being able to build an intelligent storage layer on top of a bunch of dumb commodity disks. Their needs are specialized enough (and they have enough cash) that this makes some sense. But most businesses do not have that kind of talent or that kind of cash.
Like everything in engineering, this comes down to looking at your business requirements and finding a solution that me
Re: (Score:2)
You can just have nagios monitoring for errors and even order a drive off amazon if you really wanted.
Not even touching on all the things that could go wrong with this (and there are many), the best response time you're going to be looking at for this is ~12 hours, and that's only in ideal circumstances.
NetApp will have a replacement drive on your doorstep in 4-8 hours, often less.
Re: (Score:2)
No shit
We have deduce and plain old tape. 20 lot-4 tapes cost $700. That's 20 - 60 terabytes depending on compression.
We also pay$20,000 a year in support for a dedupe software app. Plus the disk, servers and power to keep it running and you have to buy at least 2 since if your os crashes then your data is gone
Cheap disk my ass
The tape backup was a little pricey at first but the tapes hold so much and are so fast that we hardly buy any more tape. Like we used to blow $25,000 a year or more for dlt tape
Re: (Score:2)
The Netapp box does a lot more than deduping:
1: The newer models come with 512GB-1TB of SSD, and automatically place data either on the SSD, the FC drives, or the SATA platters depending on how much it is used. If the chunk of data is used all the time, it sits on the SSD. This helps a lot with the bottleneck of a lot of machines needing to access the same data block with deduplication. This is different from other disk solutions, as the NetApp chooses the "tier" of disk for you. However, a lot of serv
Re: (Score:2)
The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.
Where are you going to get three times as much storage for the same cost (well, actually it'd need to be a lot less to pay for all the additional physical and logical infrastructure) that has redundant controllers, FC, iSCSI, NFS, SMB, no-impact snapshotting, dedupe, replication and 24x7x4 support ?
Re: (Score:2)
I don't think it takes a NetApp sales rep to recognize the value of a reliable storage system. I'm sure he would say the same of EMC - it's expensive but worth every penny when you've got hundreds (or thousands) of people relying on your storage.
If you're in a 10 person office, you can get by with less, but when you've got a large corporate environment, you'll recognize the advantage of paying for Netapp or EMC.
Re:Don't forget to weigh in the cost (Score:4, Insightful)
More disk is still so much cheaper it really cannot be justified on that front. More disks also mean more IOPS, so reducing sinning platters can be a bad thing.
There are some reasons to go for it, but even with thousands of clients it may or may not be suitable for what you are doing.
Re: (Score:3, Funny)
Satan, is that you?
Cheers,
Re:Don't forget to weigh in the cost (Score:4, Insightful)
Re: (Score:2)
More disk is still so much cheaper it really cannot be justified on that front.
Sure it can, easily.
If your primary concern is up-front cost, you shouldn't be buying equipment in an enterprise environment. The up-front cost is the _least_ of your concerns.
Re: (Score:2)
More disks don't neccessarily mean more IOPS, a better storage system means better IOPS. If all you're looking for is raw IOPS, I'm sure you can build a system from commodity components that outperforms a reasonably sized Netapp or EMC filer. But you wouldn't be able to scale that system to 100TB or more.
And I wouldn't trust that home-brew system to run my company's database and other critical servers that have to run 24x7x365.
Re: (Score:2)
Are they really that superior to the Sun storage products (the ones Sun invented ZFS for) to be worth the big multiple in price? I mean, Sun isn't stuff-you-put-together-at-Frys prices either, but it's a lot cheaper than EMC or NetApp.
Re: (Score:2)
Are they really that superior to the Sun storage products (the ones Sun invented ZFS for) to be worth the big multiple in price? I mean, Sun isn't stuff-you-put-together-at-Frys prices either, but it's a lot cheaper than EMC or NetApp.
The issues that will give most people pause are those of maturity and future. Sun's solution hasn't been around for very long, and some features like FCP target and dedupe are *very* new. There's also something of a question mark over where it's going in the future with Or
Re: (Score:2)
The "big multiple in price" is really accurate either. I quickly priced out a rough equivalent to our 3140 in a 7410 with 4x20-spindle shelves, 2 "read accelerators", some 10GbE cards and 4Gb HBAs and it came out at over $250k. Obviously that's before discounting, and it has more/faster CPUs, but it's certainly in the same _ballpark_ as the ~$175k we paid.
I also just realised that $250k is probably not including support, which is likely to be knocking on the door of 6 figures for 3 years of 24x7x4 support
Not enough products (Score:3, Interesting)
Odd that if they reviewed this class of products they didn't review the most common deduping NAS/SAN applicance - the EMC NS-series (particularly NS20).
Re: (Score:2)
Re: (Score:2)
I can't say that I've ever heard dedup percentage of 95% related to the NS series, which is very similar to the products in this article (NAS/SAN server that does dedupe on live data that lives on the array). Maybe you're confusing with products like Data Domain or Avamar or something?
Re: (Score:2)
Thirded. Data Domain (now part of EMC) really started the commercial use of this...
Re: (Score:2)
If it's emc then you need to be a global fortune 10 company to afford it
I used to joke that they are like crack dealers. The initial hardware is not that much, but they get you on the disk upgrades, licenses to go above some storage size, backend bandwith, etc
Re: (Score:2)
The NS20 goes head-to-head with that NetApp box, so I'm not sure if that's true in this case (need to be fortune 10 to afford it). And from what I read a couple of days ago, it's the most commonly sold NAS product in this class...which is why I thought it was weird not to include it in the review. I'm curious what they would have said about it.
Re: (Score:2)
"Personally, I'd be _terrified_ of using dedup for primary storage. What this does is exactly the opposite of RAID - it squeezes every last bit of redundancy out of your data, and makes everything dependent upon the integrity of your blockpool database. Loose a single blocklet and you stand to lose _all_ of your data. "
Dedupe reduces multiple copies of the same data *on the same storage*
I think you're implying that having - probably purely at random - multiple copies of some files on the same FS is somehow
Re: (Score:2, Interesting)
Re: (Score:2)
That's why you have the system store more than one copy, and you have it validate their integrity when reading them. Think of it as sensible RAID. I suggest a quick Google for "zfs data integrity", etc.
Re: (Score:2)
"You are missing his point. On a non-deduplicated system if one block goes bad you lose one file, on a deduplicated system you can lose any number of files due to one bad block."
This is true, but he was saying "This is the opposite of RAID...it squeezes every bit of redundancy out of your data". Like having random duplicate copies of files scattered around a filesystem was a redundancy mechanism that is somehow on-par with RAID, and so enabling dedupe means that you have eliminated a serious data redundanc
Re: (Score:2)
On a non-deduplicated system if one block goes bad you lose one file, on a deduplicated system you can lose any number of files due to one bad block.
That's why you have RAID and block-level checksumming.
What scenario are you envisaging where this can happen ?
Re: (Score:2)
Personally, I'd be _terrified_ of using dedup for primary storage. What this does is exactly the opposite of RAID - it squeezes every last bit of redundancy out of your data, and makes everything dependent upon the integrity of your blockpool database. Loose a single blocklet and you stand to lose _all_ of your data.
If you're striving for availability by keeping multiple copies of the same data on the same physical device(s), You're Doing It (Very) Wrong.
Foredown your data (Score:2)
I can't wait until the Dilbert strip hits where the PHB does this across all their backups and deduplicates them all away, thinking he's just saved a ton of money on backup media.
Redundancy can be a very good thing!
De-Dupe on Linux? (Score:2)
Are there any open-source filesystems that offer deduplication?
It seems that the FS du-jour changes faster than any of the promised 'optional' features ever materialize.
Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.
Maybe I'm just jaded, but I've been a Linux user for 12 years now. Sometimes it feels like the names of the technologies are changing, but nothing ever gets 'finished'
Re: (Score:2)
Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.
To do all these things, you have to change how data is stored on the disk and what information is present. When you do this, you necessarily create a new file system. These aren't simple features that you can just tack onto an existing file system.
I suspect that one of these days we will be running the ext10 file system that has most of these features and evolved from ext3 in a methodical way, but it will in no way actually resemble ext3. There will always be other systems being developed to try out new
Re: (Score:2, Informative)
Use ZFS. It offers dedupe, compression, etc. (Score:4, Informative)
ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.
Re: (Score:3, Informative)
Except NexentaStor (3.0.3) has an OpenSolaris upstream (which has gone away, by the way) kernel bug that hanged our Nexenta test box. Not a real good first impression.
Re: (Score:2)
I found a ton of stuff I didn't really care for with Nexenta. They've put some good effort into it, and it'd be a fine way to go if you wanted commercial support, but overall it doesn't really seem to fit our needs here. ZFS itself is a resource pig, but on the other hand, resources have become relatively cheap. It's not unthinkable to jam gigs of RAM in a storage server ... today. Five years ago, though, that would have been much more likely to be a deal-breaker.
This is new? (Score:3, Interesting)
Didn't Plan 9's filesystem combine journaling and block-level de-duplication years ago?
Re: (Score:2)
Plan 9 could have the cure for cancer too but still no one gives a shit about it.
Dedup is a good 30 years old at least, if you want to point out that it isn't new.
Only slashdotters and Linux children get excited at silly things like this.
Re: (Score:2)
Besides, what Plan 9 user needs journaling anyway?
Ah yes, the wonders of logic: making vacuously true statements about the empty set. ;-)
I already do this (Score:4, Funny)
After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.
I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.
I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.
No DataDomain/EMC? (Score:2)
If the market leader isn't included in the review, I am wondering how worthy this report is.
Re: (Score:2)
No need to give it a fancy name.
It's much easier for sales if you give it a fancy name, and preferably one that doesn't trigger comparisons with other solutions.
Of course, as deduplication is mainly a solution for enterprises that have been tricked into buying obscenely expensive storage, and who lack any coherent data storage policy and tiering strategy, the fancy name might be superfluous; they're spread wide and lubed up already.
Re: (Score:2)
Of course, as deduplication is mainly a solution for enterprises that have been tricked into buying obscenely expensive storage, [...]
If we were "tricked", what is the cheaper, but equally capable alternative ?
Re: (Score:3, Informative)
AFAIK this is pretty much how every compression algorithm works. No need to give it a fancy name.
The reason it has a different name is to distinguish this from a compressed file system. The blocks of data are not compressed in these systems. Imagine that you have a file system that stores lots of vmware images. In this system, there are lots of files that store the same information because the underlying data is OS system files and applications. Even if you compress each image, you will still have lots of blocks that have duplicate values.
Deduplication says that the file system recognizes and elimi
Re: (Score:2)
That's still a particular type of compression, isn't it? I mean, I can buy giving it a new name, since it has a bunch of infrastructure around it, but it's a perfectly general kind of data-compression algorithm as well, even if not the world's most efficient: break the data into fixed-size blocks, then, for any blocks that appear more than once, replace all occurrences after the first with a pointer to the first. Block-based RLE compression is basically a simpler version of that (where you only deduplicate
Re: (Score:2)
That's still a particular type of compression, isn't it?
Not really. Compression is taking a chunk of data and replacing it with a different, smaller chunk of data plus instructions (albeit in abbreviated form) about how to turn it back into the original chunk of data. Dedupe is taking a chunk of data and replacing it with a pointer to a "remote", identical chunk of data.
Compression is nearly always applied on a per-file basis, whereas dedupe is applied on a per-volume basis. Conceptually similar to th
Re: (Score:3, Funny)
Re: (Score:2)
Re: (Score:2)
Diffs are fine until you lose the root file upon which they are based. Then you lose everything you've never changed. You need to do periodic full backups.
Re: (Score:2)
No, it's not.
Differential backups are taking a single filesystem, seeing what changed (either at the file level (whole changed/updated/new files) or block level (changed blocks within files).
Block level deduplication is noticing that the storage appliance on which you back up 100 desktops and 10 servers has 50 copies of the same version of each data block in each Microsoft OS file from XP, 25 from Win 7, and 35 from Fedora, and only storing 1 copy of each of those blocks rather than 100 separate ones. It's
Re: (Score:2)
Although there is nothing to say compression of data might not also happen. I don't believe compression and de-duplication are mutually exclusive.
This is actually a good argument for de-duplication to run on the device. It can surf thru files more or less at leisure looking for duplicate blocks all over the file system, without tying up the server's bus/controller.
That could be done independent of File System compression, which generally, as you pointed out, works best on large blocks of repetitive bytes
Re: (Score:2)
they dont have to do this at the file level.
they do it at the block level.
so in your example, since the only change would be the signature on the bottom of each email, the email blocks themselves would be deduped, and the signatures would be retained.
think of backing up a whole bunch of similar desktops in an enterprise situation where the majority of the OS files are going to be the same or have only slight variations.
even if the files have slight variations, only the actual bits that are different would b
Re: (Score:2)
I think what you're talking about is single instance storage in your mail server. But as you mentioned, it only works well on identical emails and attachments.
No dedupe system that I'm aware of does what you'd need to do to dedupe forwarded emails. It's technically possible by recognizing similar messages and doing diff's on them to find identical sections. But, it's computationally difficult and there's not much payback -- better to go after the lowhanging fruit and dedupe all of the identical gif's and mp
Re: (Score:2)
Did it cost less than buying 40% more disks? Heck, did it cost less than building another fileserver with 100% more disk and then syncing between them?
Re: (Score:2)
Since the dedupe license came for "free" with my filer, yes, that 40% improvement cost less than buying 40% more disks.
And yes, it's much cheaper than building another fileserver with 100% more disk and syncing between them. How much do you think it costs to build a fileserver with 150TB of disk space, and how would you recommend that I sync the 75TB of data between them? I don't think this is a job for rsync.
I do actually replicate between two identical (nearly identical) arrays, but I use my array vendor'
Re: (Score:2)
aside from the mentioned 'to reduce duplicate data to increase available storage space' are there any other benefits to de-duplicating your storage?
An intelligent caching layer will only store the deduped data once, allowing it to cache more data, get more cache its, reduce physical disk IO and improving response times.