Everything You Know About Disks Is Wrong - Slashdot

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

×

Everything You Know About Disks Is Wrong 330

Posted by kdawson on Tuesday February 20, 2007 @09:34PM from the mean-time dept.

modapi writes "Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points."

This discussion has been archived. No new comments can be posted.

Everything You Know About Disks Is Wrong

Search 330 Comments Log In/Create an Account

Comments Filter:

Re:MTBF (Score:5, Informative)

by Wilson_6500 ( 896824 ) writes: on Tuesday February 20, 2007 @09:45PM (#18091058)

Um, but doesn't the summary of the paper say that there is no infant mortality effect, and that failure rates increase with time, and thus the bathtub curve doesn't actually apply?

Parent Share
twitter facebook
Re:MTBF? RTFA. (Score:5, Informative)

by Vellmont ( 569020 ) writes: on Tuesday February 20, 2007 @09:58PM (#18091198) Homepage

You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

Well, the article actually says that drives don't have a spike of failures at the beginning. It also says failure rates increase with time. So you're right that MTBF shouldn't be taken for a single drive, since the failure rate at 5 years is going to be much higher than at one.

The other thing that the article claims is that the stated MTBF is simply just wrong. It mentioned a stated MTBF of 1,000,000 hours, and an observed MTBF of 300,000 hours. That's pretty bad. It's also quite interesting that the "enterprise" level drives aren't any better than the consumer level drives.

Parent Share
twitter facebook
Every single solid state drive will fail too... (Score:3, Informative)

by EmbeddedJanitor ( 597831 ) writes: on Tuesday February 20, 2007 @10:07PM (#18091272)

It is just a matter of time. Depending on the technology (eg. flash) it might be a short to medium time or a long time.
If something has an MTBF of 1 million hours (that's 114 years or so), then you'll be a long time dead before it fails.
At this stage, the only reasonable non-volatile solid state alternative is NAND flash which costs approx 2 cents per MByte ($20/Gbyte) and dropping. NAND flash has far slower transfer speeds than HDD, but is far smaller, uses less power and is mechanically robust. NAND flash typically has a lifetime of 100k erasure cycles and needs special file systems to get robustness and long life.

Parent Share
twitter facebook
This paper and the Google paper are complementary (Score:5, Informative)

by Thagg ( 9904 ) writes: <thadbeier@gmail.com> on Tuesday February 20, 2007 @10:10PM (#18091312) Journal

What's interesting about both of these papers is that previously-believed myths are shown to be, in fact, myths.

The Google paper shows that relatively high temperatures and high usage rates don't affect disk life.
The current paper shows that interface (SCSI, FC vs ATA) had no effect either. The Google paper shows
a significant infant mortality that the CMU paper didn't, and the Google paper shows some years of flat
reliability where the current paper shows decreasing reliability from year one.

The both show that the failure rate is far higher than the manufacturers specify, which shouldn't come
as a surprise to anybody with a few hundred disks.

I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."
Manufacturers have been pushing that principle for years, saying that "oh, we bin-out the SCSI disks
after testing" or some other horseshit, but it's not true and it's never been true. The disks are
sometimes faster, but they're not "better".

Thad

Share
twitter facebook
Re:moving parts (Score:5, Informative)

by NMerriam ( 15122 ) writes: <NMerriam@artboy.org> on Tuesday February 20, 2007 @10:19PM (#18091382) Homepage

I thought flash memory had a lower read/write cycle expectancy before crapping out?

They do have a limited read/write lifetime for each sector, BUT the controllers automatically distribute data over the least-used sectors (since there's no performance penalty to non-linear storage), and you wind up getting the maximum possible lifetime from well-built solid-state drives (assuming no other failures).

So in practice, the lifetime of modern solid state will be better than spinning disks as long as you aren't reading and writing every sector of the disk on a daily basis.

Parent Share
twitter facebook
That's wrong (Score:3, Informative)

by ArbitraryConstant ( 763964 ) writes: on Tuesday February 20, 2007 @10:22PM (#18091398) Homepage

It didn't conclude RAID 5 doesn't help, it concludes RAID 5 doesn't help as much as people think, because people think the probability of another failure before the rebuild is complete is negligible and they're wrong.

It helps, and distributing the data more helps more. Someone concerned about multi-drive failures can, for example, use a 3-way RAID 1 array, or a RAID 6 array (which can tolerate the loss of any 2 drives).

Parent Share
twitter facebook
forget RAID? (Score:3, Informative)

by juventasone ( 517959 ) writes: on Tuesday February 20, 2007 @10:38PM (#18091520)

Translation: one array drive failure means a much higher likelihood of another drive failure ... Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times.

The fact that another drive in an array is more likely to fail if one has already failed makes a lot of sense, but the conclusion to forget RAIDs doesn't. Arrays are normally composed of the same drive model, even the same manufacturing batch, and are in the same operating environment. If something is "wrong" with any of these three variables, and it causes a drive to fail, it's common sense the other drives have a good chance at following. I've seen real-world examples of this.

In my real-world situations, the RAID still did it's job, the drive was replaced, and nothing was lost, despite subsequent failure of other drives in the array. Sure you can get similar reliability at a lower price by replicating data, but I think that's always been understood as the case. Furthermore, as someone else in the forum mentioned, enterprise-class RAIDs are often used primarily for performance reasons. A modern hardware RAID controller (with a dedicated processor and ram) can create storage performance unattainable outside of a RAID.

Share
twitter facebook
Re:moving parts (Score:5, Informative)

by wik ( 10258 ) writes: on Tuesday February 20, 2007 @10:45PM (#18091610) Homepage Journal

Not true. Transistors at really small dimensions (e.g., 32nm and 22nm processes) will experience soft breakdown during (what used to be) normal operational lifetimes. This will be a big problem in microprocessors because of gate oxide breakdown, NBTI, electromigration, and other processes. Even "solid-state" parts have to tolerate current, electric fields, and high thermal conditions and gradually break down, just like mechanical parts. Don't go believing that your storage will be much safer, either.

Parent Share
twitter facebook
Re:moving parts (Score:2, Informative)

by scoot80 ( 1017822 ) writes: on Tuesday February 20, 2007 @10:56PM (#18091702) Journal

Flash memory will have about 100,000 write cycles before you will burn it out. As parent mentioned, a controller would write that data to several different locations, at different times, thus increasing the lifetime. What this would mean though is that your flash disk will be considerably bigger then what it can actually hold.

Parent Share
twitter facebook
Exponential with time (Score:3, Informative)

by tedgyz ( 515156 ) * writes: on Tuesday February 20, 2007 @10:58PM (#18091726) Homepage

All the hard drives I installed in my family's computers have failed in the last 5 years - including mine. :-(

Waaaah! They cry, when I tell them there is no hope for the family photos, barring a media reclamation service == $$$

I tell everyone: "Assume your hard drive will fail at any moment, starting now! What is on your hard drive that you would be upset if you never saw it again?"

Share
twitter facebook
Re:Desktop vs Server usage. (Score:3, Informative)

by markov_chain ( 202465 ) writes: on Tuesday February 20, 2007 @11:02PM (#18091760)

I never had a hard drive fail. I buy one more new one a year, and drop the smallest one. I run 4 at a time in a beige box PC. They are a mix of all sorts of manufacturers (usually from a CompUSA sale for less than $0.30/GB).

- I never turn off the PC.
- The case has no cover.

Parent Share
twitter facebook
Re:MTBF (Score:3, Informative)

by kidgenius ( 704962 ) writes: on Tuesday February 20, 2007 @11:16PM (#18091902)

Well, I guess you don't really understand reliability then. You also don't understand MTBF/MTTF (hint: they aren't the same) What they have said is a big "no duh" to anyone in the field. MTTF will work regardless of whether or not your failure rate is linear with time. Also, there are other distribution of failure beyond just exponential, such as the Weibull. Exponential is a subset of the Weibull. Using this distribution you can accurately calculate an MTTF. Now, the MTBF will not match the MTTF initially, but given enough time, it will eventually match the MTTF. All of this information is very useful to anyone that actually knows what to do with those numbers.

Parent Share
twitter facebook
Re:MTBF (Score:4, Informative)

by kidgenius ( 704962 ) writes: on Tuesday February 20, 2007 @11:35PM (#18092064)

No, they don't. Hard drive manufacturers state an MTTF, which is very different from MTBF. The two can be similar, but they are not interchangeable. The author of this paper has calculated MTBF, and tried to compare it to MTTF, which is WRONG. They really should've consulted a reliability engineer. Any competent one worth their salt would see the difference. One of them varies with time, the other is static and unchanging based on age.

Parent Share
twitter facebook
Re:Desktop vs Server usage. (Score:5, Informative)

by MadMorf ( 118601 ) writes: on Wednesday February 21, 2007 @12:01AM (#18092246) Homepage Journal

Most enterprise level operations that relies on their data replace drives before they fail.

You worked at an unusual place!

I'm a Tech Support Engineer for a large storage system manufacturer and I can tell you that NONE of our customers replace disks before they fail unless our OS detects a "predictive failure" for the disk. Our customers are some of the biggest names in business from all over the planet.

Parent Share
twitter facebook
Re:Amazing! (Score:1, Informative)

by Anonymous Coward writes: on Wednesday February 21, 2007 @12:07AM (#18092290)

"RAID should make your hard disk access a lot faster. That is, unless you go for software RAID"

This is wrong! SOFTWARE raid is faster. Why? Consider:
- The CPUs one buys are usually the latest and greatest.
- A 1.6GHz Athlon XP can process raid5 data at >3GB/s. This is significantly greater than your bus speed.
- If you're waiting on a disk read, chances are, your CPU isn't doing much anyways. (That said, you need to do very little to process a disk read. It's the disk writes that require checksuming).
- A raid controller adds an extra step into the disk->cpu latency
- A raid card microprocessor is spec'ed at whatever rate is needed to max a bus, or, often, significantly less. This means that any processing needed will incur a higher latency than if the data were processed by the CPU.

Roughly, for the hardware solution, all advantages are:
- Data can be considered flushed once it reaches the raid card, not the disk due to battery backuped ram (only matters for ACID databases, for systems not on UPS, without redundant power supplies)
- Batch systems may see reduced CPU use. This highly depends on the device driver being well written.
- Bus usage will be divided by 3 for small (sub ((n-1)/2)*block size, where n is the number of disks in the raid) writes, due to not having to do a read and write to update the parity.

You'll note that all of these advantages are on writes! Also, the last advantage is less important than it may seem. Very few small random IO write bound loads exist. (eg. databases will try to rearrange data to make large linear writes, requiring a bus usage of n/(n-1) in the software case)

To reiterate, usually the issue with data access isn't bandwidth, but latency. A hardware solution will not decrees this, except under specialised loads.

Parent Share
twitter facebook
Re:Every single solid state drive will fail too... (Score:3, Informative)

by Detritus ( 11846 ) writes: on Wednesday February 21, 2007 @12:09AM (#18092304) Homepage

MTBF tells you the failure rate over the item's service lifetime, which for hard disks, is commonly five years.

Parent Share
twitter facebook
Re:MTBF (Score:4, Informative)

by angio ( 33504 ) writes: on Wednesday February 21, 2007 @01:56AM (#18092954) Homepage

Your statement doesn't make a lot of sense. a) Hard drives are a non-repairable system, for all intents and purposes. Therefore, there *is* no repair. MTTF is the only useful metric. b) MTBF = MTTF + the time to repair. Assuming that's zero, then for any useful failure engineering, hard drive MTBF = hard drive MTTF. That's about all you've got if you're expressing the statistic as a single number. The reason that MTBF is a function of time is to cope with the assumption that the system is less reliable after a repair, which doesn't apply in this case.

Now, you can have all sorts of distributions that you draw that mean from, but a mean is a mean.

Parent Share
twitter facebook
Re:Infant Mortality and stuff (Score:5, Informative)

by duffbeer703 ( 177751 ) * writes: on Wednesday February 21, 2007 @02:09AM (#18093010)

That may be the new 'theory' but we all know about theory vs reality. Here in reality if you put a couple of dozen new drives into service you have one or two spare hard drives to replace the ones that WILL fail in the first week. Especially with consumer grade drives typical in workstation deployment. If you only have one dud out of twenty it was a good rollout.

This study looks pretty realistic to me, in fact its better data than the Google paper's because they are looking at different usage scenarios. The study also jives with vendor's warranty periods -- right around the 3 year mark (end of warranty) failures start going up.
I take issue with your "real world vs. theory" argument version workstation disks and server disks as well, only because I have my own numbers. Based on numbers that my company gathers for its 50,000 workstations, the disk failure rate is around 1.9% annually. (Still alot of disks) There are exceptions -- those numbers are driven upward by one deployment of workstations from a vendor that had a 22% failure rate. (the PCs were replaced by the vendor) Server disks are in the same ballpark - slightly less that 2%.

Vendors provide more evidence of that fact. Many servers are being shipped with SATA disks, often the same as what you'll find in workstations. If SATA was less reliable, that would increase the vendor's support costs and they wouldn't ship them.

You're totally right about RAID-5... it can be a dangerous thing for an inept admin. Bad disks often come in batches, and bad controllers can ruin your day. A redundant array of bad data isn't very helpful ;)

Parent Share
twitter facebook
Re:How much does handling matter? (Score:4, Informative)

by ForestGrump ( 644805 ) writes: on Wednesday February 21, 2007 @02:41AM (#18093142) Homepage Journal

the google paper was posted a day or 2 ago. let me find it.
here you go
http://hardware.slashdot.org/article.pl?sid=07/02/ 18/0420247 [slashdot.org]

Parent Share
twitter facebook
Re:Desktop vs Server usage. (Score:3, Informative)

by the_womble ( 580291 ) writes: on Wednesday February 21, 2007 @05:25AM (#18093796) Homepage Journal

There are some good reasons to shut down:

1) Electricity consumption
2) Power cuts (unless you have a UPS and software for a clean shutdown installed, what happens if there is a power cut while you are away?).
3) Power fluctuations (my power supply blew dramatically after one a few months ago) and lightning.
4) Heat (in a hot climate)

Parent Share
twitter facebook
Re:Amazing! (Score:3, Informative)

by petermgreen ( 876956 ) writes: <plugwash@nOSpam.p10link.net> on Wednesday February 21, 2007 @05:30AM (#18093816) Homepage

If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor.
afaict Linux software raid is actually pretty good nowadays at least as long as you stick to the basic raid levels

beware of the very common fake hardware (e.g. really software but with some bios and driver magic to make the array bootable and generally behave like hardware raid from the users point of view) controllers. Theese often have far worse performance than linux software raid and many of them only support windows.

Parent Share
twitter facebook
Re:Infant Mortality and stuff (Score:2, Informative)

by empaler ( 130732 ) writes: on Wednesday February 21, 2007 @06:09AM (#18093962) Journal

I actually only have good experiences with WD and was about to order a new batch of SATA disks (now-ish).

Parent Share
twitter facebook
Re:That's wrong (Score:3, Informative)

by petermgreen ( 876956 ) writes: <plugwash@nOSpam.p10link.net> on Wednesday February 21, 2007 @06:13AM (#18093976) Homepage

However, when that failure point is reached is at a random point within the distribution so while the probability of another failure at any point in time is not zero it is pretty small.

There are three real dangers with raid

The first is that arrays are typically built out of identical drives, usually drives from the same batch and then all the drives are run for the same time periods. This means that if there is a design or manufacturing fault that causes a failure peak at a certain number of operational hours there is a good chance that more than one drive in your array will fail at about the same time.

The second is that the drives in an array are typically in one machine, running off one power supply (or one pair of redundant power supplies) and connected to one controller. This means that faults with other hardware in the machine can destroy multiple hard drives at once.

The third is failure of the controller. In many cases the controller stores information on how the data is set up within its own non-volatile memory (some better controllers do store it on the disks themselves) while this doesn't destroy the actual data it can easilly put it beyond the ability of non-experts to reassemble the array in a way that gets the data back (and if they make a mistake they can easilly destory the data they were trying to recover). There is also the problem that getting a suitable replacement controller may be difficult.

Parent Share
twitter facebook
Re:Amazing! (Score:3, Informative)

by drsmithy ( 35869 ) writes: <drsmithy@nOSPAm.gmail.com> on Wednesday February 21, 2007 @06:37AM (#18094054)

That is, unless you go for software RAID, which will put a hit on your processor.
This myth needs to die. No remotely modern processor takes a meaningful performance hit from the processing overhead of RAID.
However, I think if you're going to make the investment to go with RAID 5, then buying a proper hardware controller won't add a significant amount to the cost of your set up.
Decent RAID5-capable controllers are hundreds of dollars. Software RAID is free and - in most cases - faster, more flexible and more reliable.

Parent Share
twitter facebook
Re:Amazing! (Score:3, Informative)

by drsmithy ( 35869 ) writes: <drsmithy@nOSPAm.gmail.com> on Wednesday February 21, 2007 @06:41AM (#18094070)

Uh sorta. Depends on the raid type. Striped will be faster, mirrored will be about as fast, raid 5 is gonna be the slowest, even in hardware.
Compared to a single disk, RAID5 is still going to be faster (except perhaps for the odd corner-case here and there).
Also, in many cases, software RAID5 is faster that hardware RAID5.

Parent Share
twitter facebook
Re:No "infant mortality" effect? (Score:3, Informative)

by asuffield ( 111848 ) writes: <asuffield@suffields.me.uk> on Wednesday February 21, 2007 @09:29AM (#18094868)

Love the RAID5 stat, though... Perhaps this study will finally convince people to only use RAID for performance or huge-JBOD reasons, never for (the illusion of) reliability.

It's true that you should never buy anything for the illusion of reliability, but the article does not claim RAID is not a good way to get reliability.

First, let's look at the common mistake when people think about RAID: "If the probability of a drive failure is X, then the probability of two drives in a RAID volume failing is X*X, which is much smaller". That's nonsense, as the article demonstrates - the probability is only X*X if the events are independent, which they are clearly not.

But the idea was nonsense even before that. The statement is taking the wrong attitude to the problem - it is considering the probability of data loss at *one point in time*. That's not actually what you care about - if your server dies on Tuesday, it is no comfort to you that it did not die on Monday. Here is a more sensible way to look at what is going on (ignoring backups for the moment):

Every drive is going to fail, typically within the first ten years of its life. So if you have a non-RAID system, the probability of data loss is 100% - certain. Really. Without RAID, sooner or later, you are going to lose that volume. What RAID gives you is a moderate chance of getting through the inevitable drive failures without losing the volume, and that's a chance that you never had at all without RAID. Different configurations can modify how large that chance is, but the essential feature of RAID is that you get the chance.

So what do backups get you? It's basically the same thing, except that you've got to rebuild the server. So if you just have backups and no RAID, it is a certainty that sooner or later your server is going to have significant amounts of downtime while it's being rebuilt from the backup. If downtime bothers you, you need RAID, period. Exactly what kind of RAID depends on what chance you want to take (standard risk management calculation), but there's just no contest between "certain failure" and "chance of avoiding failure" - even a 10% chance of surviving a disk failure is infinitely better than no chance (and the actual figure should be much better than that).

Lastly, what happens if you have RAID and no backups? It should be apparent that you get the same scenario as RAID with backups, only with a higher chance of failure. So there's no fundamental reason not to do that - line up the figures along with RAID+backup solutions in your risk management analysis, and pick the cheapest option for the level of risk you (or your insurance company) are willing to accept.

The impact of this study is a nice improvement in the accuracy of that analysis. Neither more nor less. If you're running large servers, this would be a good time to pull out those numbers and take another look at them (if you don't have those numbers on file, this study is not for you).

Parent Share
twitter facebook
Re:Actually, mostly it DOESN'T contradict (Score:2, Informative)

by darCness ( 151868 ) * writes: on Wednesday February 21, 2007 @10:31AM (#18095436)

"There is this assumption permeating the whole society that if something is expensive, it _must_ automatically be better"

This is known as the Veblen Effect [wikipedia.org] based on work by Thorstein Veblen [wikipedia.org].

Parent Share
twitter facebook
Re:moving parts (Score:3, Informative)

by Maximum Prophet ( 716608 ) writes: on Wednesday February 21, 2007 @03:43PM (#18099856)

If you look at the numbers for the failure of the system RAM and assume that most machines have much, much more disk space than RAM, SSD's don't make sense. They are faster, but you won't get better MTTB's. On the HPC1 and COM1 groups of machines, the memory was replaced almost as often as the hard drives. If you had to replace all that HD space with RAM, your failure rate would go though the roof.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

613 commentsIs the Obsession with EV Range All Wrong?
463 commentsElon Musk Predicts Electricity Shortage in Two Years
459 commentsIs 8GB of RAM Enough For a Mac?
426 commentsWhat's the Solution to Gridlocked EV Chargers?
418 commentsWhy EVs Won't Crash the Electric Grid

"I've seen it. It's rubbish." -- Marvin the Paranoid Android