RAID's Days May Be Numbered 444
Posted
by
kdawson
from the time-to-try-flit dept.
from the time-to-try-flit dept.
storagedude sends in an article claiming that RAID is nearing the end of the line because of soaring rebuild times and the growing risk of data loss. "The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology's limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable. The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it's a scenario that will eventually require a solution, if not a whole new way of storing and protecting data."
reallocate on write (Score:3, Informative)
Or just regenerate and write the one sector from the parity data since all modern hard disks reallocate bad sectors on write.
Solved a Long Time Ago (Score:5, Informative)
Re:simple idea (Score:5, Informative)
Enterprise arrays copy all the good data off the drive to a spare drive, use RAID to recover the failed sector(s), then fail the broken disk.
Re:Hardware RAID is dead (Score:3, Informative)
> First of all, "Hardware RAID" is still software, just executed by dedicated circuits. The distinction is kind of moot.
I'm not sure where in my post you saw anything about a comparison between Hardware RAID or Software RAID.
> So my guess is that you're not working for a storage vendor. I haven't seen many people switch to SW RAID recently.
I work for NetApp. I didn't think it mattered much in the post I made though. To your second point, as all of the NetApp Enterprise storage systems use software based RAID I can happily confirm that many hundreds of thousands of customers have switched to software RAID.
As you mentioned earlier though the point is moot since when you're delivering an enterprise array to a customer it doesn't matter if the array uses RAID cards provided by a 3rd party vendor, uses RAID cards built in-house, or uses software RAID to write the data that the customer gives you. The ingress point for the customer is a physical port (IP/FC typcially) and that port provides RAID capabilities. Maybe that's also hardware RAID?
ZFS (Score:5, Informative)
This is something the ZFS creators have been talking about for some time, and been actively trying to solve.
ZFS now has triple parity, as well as actively checksumming every disk block.
Re:reallocate on write (Score:5, Informative)
Re:ZFS (Score:5, Informative)
I thought I should add:
ZFS speeds up rebuilding a RAID (called resilvering) over traditional non-intelligent or non-filesystem based RAIDS by only rebuilding the blocks that actually contain live data; there's no need to rebuild EVERYTHING if only half the filesystem is in use.
ZFS also starts the resilvering process by rebuilding the most IMPORTANT parts first; the filesystem metadata and works its way down the tree to the leaf nodes rebuilding data. This way, if more disks fail, you have attempted to rebuild the most data possible. If filesystem metadata is hose, everything is hosed.
ZFS tells you which files are corrupt, if any are, and insufficient replicas exist to due failed disks.
All this on top of double or triple parity. :)
Re:I thought RAID was about spindle count (Score:5, Informative)
You don't rely on RAID to avoid data loss; you rely on it as a first line in providing continuity. We run backups of large systems here, but we tend to do other things too: synchronous live mirroring between sites of the critical data. And beter system design. There are some systems where, whilst we _could_ go back to tape (or VTL) at a pinch, having to do so would be a disaster in itself.
We're designing systems that permit rapid service recovery (the most live critical data) and a second tier of online recovery to get the rest back. We just can't afford the downtime.
Double-spindle failures on RAID systems are just one of those things that you _will_ see. Deciding whether a system deserves some other measure of redundancy is mostly an actuarial, rather than a technical, decision.
Re:RAID is here to stay (Score:5, Informative)
And when RAID 6 has a high enough risk that it's worth expanding the scheme everyone will start switching from double parity schemes to triple parity schemes since their much less expensive in terms of spindle count than RAID 6+1.
I don't think you've quite understood the problem described. You can have an infinite number of parity disks, but it does you no good if recovering one data disk causes another data disk to fail.
Imagine a disk fails on every 100TB of reads (10^14). You have ten 1TB data disks. Imagine you keep them in perfect rotation so they've spent 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100% of their lifetime. The last disk dies and you replace it with a new drive (0%). To rebuild the drive you read 1TB from each data disk and use whatever parity you need. They've now spent 11, 21, 31, 41, 51, 61, 71, 81, 91 and 1% (your new disk) of their lifetime and you can read another 9TB before you need a new disk.
Now we try doing the same with ten 10TB disks and the same reliability. The last disk dies and you replace it, only now you must read 10TB from each disk. Instead of adding 1% to the lifetime it adds 10% so that they've spent 20, 30, 40, 50, 60, 70, 80, 90, 100 and 10% (your new disk) of their lifetime. But now another disk fails, you can recover that but then another will fail and another and another and another.
Basically, parity does not solve that issue. If you had a mirror, you would instead copy the mirrored disk with significantly less wear on the disks. RAID is very nice as a high-level check that the data isn't corrupted but it's a very inefficient way of rebuilding a whole disk.
Re:Hardware RAID is dead (Score:4, Informative)
When I think of software RAID, I think of parity data being handled by the operating system, being done on x86 chips as part of the kernel or offloaded via a driver (thinking Fake-RAID).
If you're abstracting your storage away from the operating system that uses it, say via iSCSI or NFS or SMB to a dedicated storage box, like a NetApp filer or a Celerra, then I would consider that hardware RAID, personally speaking. If you're saying that these dedicated storage boxes manage parity, mirroring and so on all done with the same chip that's also running their local operating systems, then I have to admit that yes, that sounds like software RAID to me, but the real distinction I've come to draw between software and hardware RAID is a matter of performance and feature set. If said boxes give the same or better performance (I/Ops and throughput) to a workload as a dedicated, internal storage system managed by something like my 9650SE, then hell..... who cares, right? Aside from being rather impressed that such is possible without dedicated XOR chips, that is.
Re:Bogus outdated thinking (Score:4, Informative)
And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.
Yeah, every time an article on storage catches my eye, I have to check laptop SSD prices. So far, each time I do this, for the cost of a drive the size I need, I could buy a new snowboard, or a laptop, bike, half a holiday, room full of beer... etc. I really want one, but so far I haven't been able to look at that list and say "I'd rather have an SSD!"
Re:simple idea (Score:4, Informative)
They do to varying degrees of success but just because a disk can't read a particular sector doesn't mean that the drive is faulty - it could be a simple error on the onboard controller that is causing the issue.
FC/SAS drives mostly leave error handling up to the array rather than doing it themselves because the arrays can typically make better decisions as to how to deal with the problem and helps cope with time sensitive applications. The array can choose to issue additional retries, reboot the drive while continuing to use RAID to serve the data, etc.
Consumer SAS drives on the other hand try really hard to recover from the problem - for example retrying again and again with different methods to get the sector and while admiral that leads to behaviours we see in consumer land where the PC just "locks up". The assumption here is that there is no RAID available and so reporting an error back to the host is "a bad thing". The enterprise SAS drives we're seeing on the market are starting to disable this automatic functionality to make them behave correctly when inserted into RAID arrays.
Usually ;-)
Re:Bogus outdated thinking (Score:1, Informative)
Actually I run a RAID 5 array off of a server in my home. 8 147 GB SAS 15000rpm drives. I'm a photographer and have tens of thousands of images that I need an affordable storage solution for, and RAID 5 does the trick, along with off site back up.
For one SSD's are simply not practical, nor cost efficent yet, and certainly not in the size and quantity I would require. For two, your argument simply doesn't wash, even if you use SSD's it doesn't eliminate the need for a RAID array, not for someone that truly needs the fault tollerance and redundancy which is the reason for having an array in the first place. Your argument is simply to build an array at four times the cost. Sure, I can afford to spend 6 grand building an SSD RAID but the real question is why would I when I can have an enterprise class solution for $1500. Your summation is just ridiculous.
On a side note, if someone has a true need for RAID and they're using a software RAID solution then they're asking for problems. A hardware solution should be the ONLY consideration for a real RAID setup.
Re:Wrong title. Or dramatization again? (Score:2, Informative)
Re:Solved a Long Time Ago (Score:3, Informative)
Well, the point the of the article is that if it takes your array 6 hours to rebuild instead of 4 because the capacities have gone up but the failure rate of the hardware is unchanged you have a problem. The problem is that you are more likely to experience another failure before the first one has been mitigated. If you have that additional failure on most raids (unless you are doing 5-5 or 1-5 or some other RAID over RAID scheme) you get down time. The volume is off line and must be restored from some other location.
The solution is usually a cluster or remote hotsite or something like that. It would be nice to have fast rebuild times back. There are lots of situations were 5 nines is not a requirement but downtime still should be avoided, shorter exposure windows for array rebuilds are a good thing.
Re:Bogus outdated thinking (Score:5, Informative)
The problem is IT guys and PHB's that think RAID=Backup.
It's not and it never has been a backup solution. RAID is high availability and nothing more.
RAID does it's job perfectly for high availability and will continue to do so for decades. Sorry but I have yet to see any other technology deliver the capacity I use for my small 30TB Database we have at work. Our Raid 50 array works great. We also realtime mirror that to the Backup SQL server (not for backup of data but backup of the entire server so that when SQL1 goes offline SQL2 picks up the work.)
SQL2 is backed up to a SDAT tape magazine nightly.
RAID does what it's supposed to do perfectly, it's days are not numbered because no other technology other than RAID can provide high availability.
Re:RAID is here to stay (Score:3, Informative)
Actually, reliability quickly scales towards RAID 1+0 as the number of drives increases. In a 14 drive array, a single drive failure in both is fine. A second drive failure has the possibility of destroying the RAID 1+0 array, but the chance of the right drive failing is low. With 3 total drive failures, RAID 6 will fail, while RAID 1+0 has a low probability of failure.
Rebuild times are also much shorter on RAID 1+0 as only a single drive has to be read, which reduces heat produced and the chance of a second failure.
There are some papers that describe the math of the statistical analysis to prove it, but I can't track it down at the moment. It is a rather counter intuitive. But, you have significantly less drive space, so RAID 6 may still be the better option for some circumstances.
Re:There are always more solutions... (Score:3, Informative)
Who says there are no errors with optical media?
I've seen a CD with light shining through after 5 years.
Re:simple idea (Score:2, Informative)
The speed of a 15k drive means that the outer edge of the 3 1/2" drive is spinning pretty fast... getting close to the speed of sound
3.5in * 3.14 * 15000r/m * 60m/h * 1ft/12in * 1mi/5280ft = 156mi/h
That's still pretty fast, but not nearly the speed of sound at STP.
Re:Bogus outdated thinking (Score:3, Informative)
Granted, our smallest config is 9TB; We're somewhat overkill for a home user. But if you need a company-wide NAS...
Commodity hardware, standard networking (Gig and 10Gig Ethernet frontend, Infiniband backend), and a very smart filesystem (Capable of protecting from up to 4 simultaneous whole-node failures) == a killer combination; It takes some seriously bad luck for data-loss to become a problem.
Re:simple idea (Score:4, Informative)
Re:Bogus outdated thinking (Score:2, Informative)
Re:simple idea (Score:3, Informative)
Re:simple idea (Score:4, Informative)
Speed of sound at sea level: 340.29 m/s verify [google.com]
((3.5 inches) * (2.54 (cm / inches)) * pi) * (((15000 / minute) * (1 minute)) / (60 second)) * (0.01 (meter / centimeter)) = 69.8218967 m / s verify [google.com]
If my calculation is correct, the outer edge of a 3.5" plate spinning at 15000 RPM is moving at 69.82m/s, which is about 20% of speed of sound. It's fast, but it's nowhere near the speed of sound.
Re:simple idea (Score:5, Informative)
Air is necessary for the read/write head to operate. The piece that comes into close proximity of the platter is essentially a tiny hovercraft. It's about the size of a pepper flake, and has a microscopic pattern called an "air bearing" carved into the side facing the platter. Designing this air bearing is an exercise in fluid dynamics -- it is the shape of the bearing and how air flows over it that allows the read/write head to skim over the surface of the platter at a distance measured in microns without actually contacting the surface of the platter.
If the read/write head does contact the surface of the platter, that is called a head crash, and is bad.
Re:Bogus outdated thinking (Score:3, Informative)
To do a true raid-5, cost of the drives is fairly negligible.
While you are absolutely correct about cost, I think your definition of what a true raid-5 is needs a little work.
The purpose of RAID-n is to survive failures with near-zero downtime. The larger the disparity grows between capacity and performance as array sizes increase, the less and less these RAID's are serving their purpose. The chance of a drive failure while rebuilding a multi-TB array is quite significant, an occurrence that RAID-n was supposed to minimize to near-zero levels.
In the future, there will only be RAID-0 and RAID-JBOD for conventional drives. Uptime will have to be solved another way, because RAID-n solves it less and less as the years (and thus, capacity) tick away.
Re:Dear Seagate, Western Digital, et. al: (Score:1, Informative)
And then the onboard hard drive controller fails. Zap. Game over.
RAID on discrete disks in really good at avoiding that kind of hardware failure.
Re:simple idea (Score:3, Informative)
No - to reconstruct 1 sector you have to read one sector from every other drive, then write 1 sector to the replacement drive. Effectively, to reconstruct you have to read thw whole raid. So the read and write speeds both count.
Re:Bogus outdated thinking (Score:1, Informative)
I use Linux software RAID5 with three 640GB SATA drives in my home PC which serves as my MythTV DVR as well as my fileserver. At the time I assembled it, this was the sweet spot of price, size, performance, and power efficiency.
I have an almost identical host located about 400 miles away in a relatives home, serving as a mirror for my fileserver content (but not DVR content which I consider disposable, but still RAID protected just for convenience of recovery if a simple disk error hits me). That host has a mixture of 640 GB drives and 320 GB drives because it has actually evolved over six years since I originally assembled it with three 160 GB drives.
I replaced drives with larger cheap ones when something failed or its SMART data was looking iffy or I needed more space, always maintaining RAID5 level protection for my data except during brief degraded array events. I always purchased cost effective replacements which were usually larger, allowing a size progression like this: 3x 160 GB ... 2x 160 GB + 1x 320 GB (ignore upper 160 GB) ... 2x 160 GB + 2x 320 GB (migrate into logically 3x 320 GB by having upper and lower half of each 320 GB drive associate with one of the 160 GB drives) ... 3x 320 GB; repeat part of this sequence with 640 GB replacing 320 GB drives. By the way, I also migrated the host hardware through CPU speed upgrades, chassis and motherboard upgrades, whole conversion from AMD to Intel CPU, and many operating system replacements in that six year period. During all of this, Linux MD RAID let me maintain the same data arrays on the same disk set while swapping out such other components.
My backup strategy is to keep generational backups on the same disks (separate RAID5 filesystems) on each host and frequently synchronize the main fileserver image over the Internet with rsync. So recent changes are propagated between hosts and each keeps its own running generational snapshots so I can recover from a complete system loss with a worst case effort of sneaker-net to carry bulk data 400 miles. In practice, I've never experienced any complete system loss, though I have had to rebuild the boot OS remotely in order to gain access to the still intact data arrays after one peculiar hardware error event.
In my experience, Linux MD RAID has been wonderful. I was able to do the above-mentioned reconfiguration of disks in a live system (only powered down temporarily to physically install and remove internal SATA drives). I create the RAID arrays over disk partitions, so I can selectively add and remove the disk zones in chunks using mdadm. Rather than try to resize2fs on these multi-year filesystems, I admit that I did tar/reformat/untar some filesystems one time to resize and defrag.
Re:Bogus outdated thinking (Score:3, Informative)
Try a rebuild on a much larger aggregate running a dual parity array under load. Trust me, they can easily run days. Say you have a 16 disk aggregate using 1TB 7200RPM disks. Because you need every block in a stripe to reconstruct parity, you need to read from the other disks to reconstruct; so 14 reads and 1 write per block.
You're also misunderstanding how the SSD caching works for ZFS. Blocks are only pulled in after repeated requests, which isn't going to be the case for a resliver. There will be at least some benefit to read ahead caching in memory, but even that has sharply diminishing returns, particularly with the ZFS rebuild strategy of reconstructing at a file level rather than a linear block rebuild. That approach has significant benefits though. By walking through the metadata instead of blindly copying blocks you don't have to rebuild empty space, and if - god forbid - you lose more than one drive in a RAID-Z or two drives in a RAID-Z2 array, you still have a partial recovery to work with.
Re:simple idea (Score:2, Informative)
Re:simple idea (Score:3, Informative)
I'm surprised that nobody has mentioned the issue of failure of the drive material itself at higher rotational velocities.
I believe CDs are limited to 52X because the polycarbonate they are constructed of explodes when you get too much higher than that (with a safety factor of course).
A metal hard drive probably can take more speed, but I'm sure that at some point you get deformation of the platter. You also have bearings/etc to deal with. 30k is a pretty fast rotation rate - and we're talking about a device that is always-on.
Additionally, even 10k SCSI drives aren't exactly consumer-grade hardware. We're already getting in to the high-end realm, and the whole point of RAID was the "I."
Re:simple idea (Score:3, Informative)
You know google does the conversion for you: 2*pi*3.5 inches * 15,000 minute^-1 in m/s [google.com] = 140 m / s
Re:simple idea (the bad old days) (Score:2, Informative)
In either those old open ones or the "new" sealed ones, the head flies on a cushion of air, but the distance from head to platter is microspic; a piece of dust is big in comparison. In the old open drives, if the head hit even a tiny piece of dirt, it could "crash" into the platter and gouge out a rip. If you haven't heard it, it was actually fairly loud and startling.
Re:simple idea (Score:2, Informative)
You're not likely to see 30k RPM drives any time soon. The speed of a 15k drive means that the outer edge of the 3 1/2" drive is spinning pretty fast... getting close to the speed of sound ...It's why CDROM speeds haven't gone up much since the old day of 52x...
Perhaps I haven't taken a math class in a while, but my cocktail napkin calculation says that a 3.5 inch disc spinning at 15,000 times per minute will travel just over 156 miles/hour. No where near 761 mph (speed of sound).
3.5 x Pi = 11 inch circumference x 15000 = 164,933 inches per minute / 12 inches / 5280 feet/mile * 60 minutes/hour = 156 mph.
Furthermore, while I don't argue your point that they are spinning pretty fast, I disagree with your assertion that CDROM's haven't increased because of this. More like, I believe CDROMs are simply not manufactured within sufficient tolerances, as indicated by their frequent vibrations when they spin up, and such vibrations could cause them to shatter.
For amusement: http://www.powerlabs.org/cdexplode.htm [powerlabs.org]
Re:Bogus outdated thinking (Score:3, Informative)
As for the rebuild times, fine, go buy FASTER drives.
Hard drives are getting bigger faster than they are getting faster.
Hard drives are getting bigger faster than they are getting more reliable.
In an enterprise setting, SATA based storage is a reality, for cost reasons, in tiers 2 and 3.
Your suggestion that this problem is solved simply by buying faster drives is a poor one.
And in a few generations of high speed drives, the problem with manifest regardless.
Henry's article is not as clear as it could be, however. He's really talking about the pending failure for traditional raid sets as we know them, such as aggregates of N drives in a set, or drives hung off a RAID controller. RAID as an algorithm for error correction is nowhere near failure. Look at the manner in which Isilon does it. All the data in an isilon system is part of a clustered RAID approach, but this is distributed in data packets far different than standard block. All nodes in an Isilon cluster participate in a "RAID rebuild" when it's needed; the system is capable of multigigabyte per second RAID rebuild, and it only rebuilds what is needed, not the "disk". This can all be done with economical SATA drives.
Note, however, that Isilon's RAID is not really RAID at all. I.e., it's not about arrays of disk, but rather partity based correction of lost file redundancy data. I.e., it's more object based, such as Henry was alluding to.
As for the classic RAID set, Henry is quite right when he says that it is trying to die. RAID rebuild times are already in excess of 24 hours, and are going to be that much worse with 2TB and 4TB drives. With longer RAID rebuilt times, pDATALOSS increases notably, particularly if you are aware the Google and Carnegie findings that drives actually tend to fail at the same time. I.e., pFAIL of a HD is not independent of pFAIL of other HD's in a RAID set. They tend to fail together.
C//
Re:fill the drive with helium (Score:3, Informative)
don't get it, why not just seal them hermetically with helium inside, and not worry about outside air pressure?
1. Hasn't been necessary
2. Helium is expensive
3. Sealing something Helium-tight is expensive, about as bad as trying to seal in hydrogen*
4. Fairly sensitive to pressure - not a problem in a non-airtight HD, but a problem in a sealed HD that's heating up.
5. Cooling can be an issue
*Mostly because He tends to stay monoatomic, H pairs up into H2. End result is that the H2 molecule is around the same size as a He atom.
Re:fill the drive with helium (Score:4, Informative)
Filling the drive with helium should help;
Yeah. For about half a week. Helium has the smallest "gas particles" there are - Hydrogen atoms would
be smaller, but those really like to bond, and an H_2 molecule is quite a bit larger than a Helium atom
That's why He leaks out of everything. No exception. It diffuses through "leakproof" welds for vacuum tanks.
It diffuses through the steel walls of tanks (albeit more slowly). That's also why He is used in leakage detection:
If you see less than $not_so_few He atoms on the outside of the container you test within a couple of seconds after you injected a little bit of He, the container is considered airtight.
The only way to keep a HE atmosphere in your drive would be to constantly refill it. I don't think that there'll be any scenario where this would seem like an even remotely good idea.
Re:simple idea (Score:3, Informative)
Besides which I have no idea what the speed of sound has to do with the theoretical upper limit of the speed of a spinning disk. It's not like an airplane wing with a trailing shock wave. I would think there would be much more pressing problems that are keeping us from seeing 30K RPM hard drives anytime soon, like:
- Shear strength of the platter material
- Total mass of the platter, especially near the edge
- Heat generated in the bearings
- Energy necessary to spin the platter at that speed
- Torsional forces from rotating the drive while it's spinning
And probably down near the bottom of the list of potential problems:
- Cavitation and/or shock waves from the air around the spinning platter.