Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Hardware

Everything You Know About Disks Is Wrong 330

modapi writes "Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points."
This discussion has been archived. No new comments can be posted.

Everything You Know About Disks Is Wrong

Comments Filter:
  • MTBF (Score:5, Interesting)

    by seanadams.com ( 463190 ) * on Tuesday February 20, 2007 @09:36PM (#18090970) Homepage
    MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

    Suppose a tire manufacturer drove their tires around the block, and then observed that not one of the four tires had gone bald. Could they then claim an enormous MTBF? Of course not, but that is no less absurd than the testing being reported by hard drive manufacturers.
  • i'll tell you (Score:3, Interesting)

    by User 956 ( 568564 ) on Tuesday February 20, 2007 @09:43PM (#18091042) Homepage
    Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?

    It means I should be storing my important, important data on a service like S3. [amazon.com]
  • Re:moving parts (Score:2, Interesting)

    by Nimloth ( 704789 ) on Tuesday February 20, 2007 @09:57PM (#18091188)
    I thought flash memory had a lower read/write cycle expectancy before crapping out?
  • by Anonymous Coward on Tuesday February 20, 2007 @10:02PM (#18091232)
    Also, residential power is less clean than datacenter power. Bad power can take out the drive electronics.
  • Re:MTBF (Score:4, Interesting)

    by gvc ( 167165 ) on Tuesday February 20, 2007 @10:04PM (#18091252)

    MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.
    The simplest model for survival analysis is that the failure rate is constant. That yields an exponential distribution, which I would not characterize as a bell curve. The Weibull distribution more aptly models things (like people and disks) that eventually wear out; i.e. the failure rate increases with time (but not linearly).

    With the right model, it is possible to extrapolate life expectancy from a short trial. It is just that the manufacturers have no incentive to tell the truth, so they don't. Vendors never tell the truth unless some standardized measurement is imposed on them.

  • Cyrus IMAP (Score:3, Interesting)

    by More Trouble ( 211162 ) on Tuesday February 20, 2007 @10:05PM (#18091258)
    From StorageMojo's article: Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times. If I'm an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

    For best-of-breed open source IMAP, that means Cyrus IMAP replication.
    :w
  • by gelfling ( 6534 ) on Tuesday February 20, 2007 @10:32PM (#18091474) Homepage Journal
    I wonder if anyone looked at what actually failed in the drives? An arm, a platter, an actuator, a board, an MPU?

    Would an analysis tell us that SSDs are not only faster but more reliable and if so by how much?
  • by Lumpy ( 12016 ) on Tuesday February 20, 2007 @10:38PM (#18091516) Homepage
    Or she forgot to put in the part that Enterprise drives are replaced on a schedule BEFORE they fail. At Comcast I used to have 30 some servers with 25-50 drives each scattered about the state. every hard drive was replaced every 3 years to avoid failures. These servers (Tv ad insertion servers) made us between $4500-13,000 a minute they were in operation in spurts of 15 minutes down 3-5 minutes inserting ad's. Downtime was not acceptable so we replaced them on a regular basis.

    Most enterprise level operations that relies on their data replace drives before they fail. In fac tthe replacement rate was increased to every 2 years not for failure prevention but for capacity increases.
  • by RebornData ( 25811 ) on Tuesday February 20, 2007 @10:43PM (#18091588)
    What's interesting to me is that neither of these papers mentions the issue of pre-installation handling. The good folks over at Storage Review [storagereview.com] seem to be of the opinion [storagereview.com] that the shocks and bumps that happen to a drive between the factory and the final installation are the most significant factor in drive reliability (much more than brand, for example).

    The google paper talks a bit about certain drive "vintages" being problemmatic, but I wonder if they buy drives in large lots, and perhaps some lots might have been handled roughly during shipping. If they could trace back each hard drive to the original order, perhaps they could look to see if there's a correlation between failure and shipping lot.

    -R

  • by Anonymous Coward on Tuesday February 20, 2007 @10:51PM (#18091652)
    I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."

    I have been saying that for at least 10 years. Back then I worked at a large government contractor and we set up what was then a very large 2 TB array of SCSI drives (about 100 drives). Those damn things were "industrial grade" certified by a large well known server vendor yet we were losing 2 or 3 drives per day for several months. Totally rediculous because I extrapolated the failure rates of IDE drives from another government setup and found it was actually much better than the SCSI drives and they weren't even rated for heavy duty usage.

    Of course prior to this article the group-think Slashweenies would moderate me into oblivion (probably will anyway, but meh).
  • by DarkVader ( 121278 ) on Tuesday February 20, 2007 @11:02PM (#18091770)
    1 in 20 drive failures? What are you using, Western Digital drives? I don't see anything close to that failure rate, more like 1 in 300.

    I don't deploy "enterprise" drives, they're overpriced, and the few I did install years ago proved to be less reliable than "consumer" drives. My real world experience is that the "consumer" drives are generally reliable, I just plan on a 2-3 year replacement schedule.

    I can't disagree with RAID being fallible depending on what takes out the drive, though.
  • Re:moving parts (Score:4, Interesting)

    by blackest_k ( 761565 ) on Tuesday February 20, 2007 @11:37PM (#18092090) Homepage Journal
    Still doesn't mean it will last, got a 1 gig usb flash drive here dead in less than 8 weeks and very few read and writes. It will not identify itself. It might have 99,900 write cycles left but its still trashed.
    Lets face it there is no reliable storage media, the only way to be safe is multiple copies.

       
  • by Reziac ( 43301 ) * on Tuesday February 20, 2007 @11:53PM (#18092200) Homepage Journal
    Well, I can connect my own anecdots ;) Once they're fully set up, my everyday machines are never powered down again (except to upgrade the hardware), nor do the HDs spin down. They are also on good quality power supply units, AND are protected by a good UPS, AND have good cooling. Those 3 points can make all the difference in the world to their longevity, regardless of use patterns.

    Right now my everyday HDs number thus:

    6.4GB W.D. -- new in 1998, has always run 24/7. No SMART but probably has upward of 70,000 hours uptime. (Its identical twin failed about a year ago, but it had always clanked louder while doing thermal recalibration. This one is still quiet.)

    8.4GB W.D. -- new in 1998, used about 12hrs/day thru 2002, offline 2002-2006, running 24/7 for the past year. No SMART but probably has about 25,000 hours uptime.

    45GB W.D. -- SMART data: 42093 hours uptime, 181 power cycles (mainly as hard resets).

    40GB W.D. -- SMART data: 3919 hours uptime, 197 power cycles. (Dated 2002; found in trash in 2006)

    60GB W.D. -- SMART data: 28056 hours uptime, 100 power cycles (mainly as hard resets)

    Running 24/7 pretty much eliminates thermal stress and the "what do you mean you're not powering up today?!!" that happens sometimes with older HDs.

    Other points of conventional wisdom about running fulltime:
    1) "It causes more bearing wear." I wonder if that's so -- might the lubricant stay better distributed when it never chills down and never gets a chance to settle and congeal??
    2) "It's more likely to stiction if it does sit til it's cold." In my experience it's the opposite -- the HD with only intermittent use is far more likely to stiction, and sometimes can be cured permanently by letting 'em run for a few days solid.

    One of the points in TFA was that over 40% of RMA'd HDs proved to have nothing wrong with them. This is in line with my own observations (in fact, closer to 100% in SOHO/home-user environments) -- many supposed HD failures are actually user or software errors, not the hardware at all.

    I don't know that this is at all helpful :) But my recommendation to my clients is that if they don't want to run 24/7, they should not power the machine on and off more than once a day.

  • by bill_mcgonigle ( 4333 ) * on Wednesday February 21, 2007 @12:33AM (#18092480) Homepage Journal
    Well, the article actually says that drives don't have a spike of failures at the beginning.

    Hmm, the Google paper says they do, from 3-6 months (Figure 2).

    Which leaves us with confirmation that 50% of all studies are wrong.
  • by cats-paw ( 34890 ) on Wednesday February 21, 2007 @12:52AM (#18092598) Homepage
    I keep hearing this persistent rumor that it's disk spin-up which is the most significant contribution to disk failure. The moral of the story is that systems which are left on 24/7 are less likely to see HD failures than systems turned on/off everyday.

    Now if that's really true, wouldn't it be quite simple for the manufacturers to simply spin-up the disk more slowly by putting in very simple and reliable motor control circuitry ?

    Does anyone have any real evidence, i.e. not anecdotal, that this is really true.

  • Re:forget RAID? (Score:2, Interesting)

    by Ragin'Cajun ( 135704 ) on Wednesday February 21, 2007 @01:50AM (#18092902) Homepage
    I used to work at a company that made network-attached storage appliances. Amazingly enough, one source of drive failures was the hot spare spinning up! The current draw during the spinup would cause a voltage dip on the power plane, which could lead to a read or write error on one of the neighbouring drives. Unfortunately, the most common cause of the hot spare spinning up was...another drive failing. So suddenly a second drive fails because of a read or write error.

    The thing is, sometimes getting a read error doesn't actually mean the media is bad. There could have been some power fluctuation during the write, so the checksum doesn't match the data and the drive's controller returns a failure during the read. But if you rewrite that sector, it will be fixed (e.g. during an unconditional format).
  • Re:Amazing! (Score:3, Interesting)

    by Kadin2048 ( 468275 ) <slashdot.kadin@xox y . net> on Wednesday February 21, 2007 @01:58AM (#18092968) Homepage Journal
    Somewhere around I have an Apple 20MB hard drive that is getting on 15 years old. Sure, it hasn't seen a lot of usage recently, but I still fire it up every once in a while. (It makes the greatest turbine-like startup sound; seriously, it's like a 747.) Connects to the floppy disk controller. Has its own power supply.

    I'm sure there are people around with even older, still-working-fine gear. A while back, I saw some DEC disk packs for the early removable-platter hard drives selling on eBay, as pulls-from-working equipment. I'm not sure what exactly was going through the minds of the designers when they were building stuff, a decade or two ago, but they just seemed to not be planning for obsolescence in the same way that the people churning out today's disposable gear are. (Although the sample is clearly biased: looking at the 20-year-old gear from 1986 that's still around today might make you think that everything then was bulletproof, but in reality all the crappy stuff is already 30 feet down in some landfill somewhere.)

    I suspect in 20 years, people will look back at 2006 gear as the height of reliability, just because it'll only be the really exceptionally well-built pieces of gear that will still be around. The Deathstars and other crap drives that failed will long be forgotten.
  • by yoprst ( 944706 ) on Wednesday February 21, 2007 @02:05AM (#18092988)
    It's broadcasting, dude! No downtime is allowed. Here in Soviet Russia we (broadcasters) do exactly the same, except that we prefer 2-year period.
  • Re:Amazing! (Score:3, Interesting)

    by 10Ghz ( 453478 ) on Wednesday February 21, 2007 @07:30AM (#18094248)
    "If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor."

    Since we are talking about IO-bound operations, does that matter? I mean, CPU is hardly ever the bottleneck these days, the hard-drive quite often is. So even if soft-RAID puts more load on the CPU, does it cause any slowdown? Espesially if it makes IO faster?
  • by Anonymous Coward on Wednesday February 21, 2007 @08:02AM (#18094394)
    Unfortunately, this paper is severely flawed. Similar to the Google paper, it is written by academics with little understanding of the subject matter, but a strong desire to publish lengthy papers.

    To write a meaningful paper, there is a lot of data about the drives and the systems they are used in that needs to be collected. These are initial conditions and operating conditions that any real system scientist will tell you cannot be ignored (to say the least). One cannot look at drives in the abstract, but must look at many details of how they are used, including the storage systems they are part of.

    Google, to their credit, did collect the SMART measurements. That is a good start, but not sufficient data to support the conclusions of the Google paper.

    For example, the orientation of each drive needs to be taken into account. What percentage of the drives analyzed were mounted horizontally vs. vertically? How were the drives themselves mounted? Specific mounting techniques result in a greater incidence of particular failure patterns. How were the drives cooled? Particular cooling techniques similarly result in specific failure patterns. What sort of data usage patterns were in use? What levels of RAID were used across the various drives?

    I see no measurements of vibration in this paper. Drive orientation and drive vibration (including system-based vibration) are two factors that are very important in determining drive reliability. Drives have a certain resistance to vibration (and shock) that varies based on the directionality of the vibration.

    We also see no meaningful treatment of the conditions for the HPC1, COM1, and COM2 systems. In HPC1 and COM1 we see massive failure levels for memory, likely indicating severe heat problems in those systems. In the COM2 system, we see a very high incidence of motherboard failure, again mostly likely indicating heat problems (or possibly bad caps). Specific heat conditions are operating conditions for drives that must be taken into account. Maybe the early onset of wear-out degradation is at least in part due to heat?

    I have merely touched on several important elements of study that were neglected in both papers. To gain a real understanding of drive failure in the "real world", real and comprehensive data is needed first. Otherwise we are dealing with merely variations on the "GIGO / Garbage In Garbage Out" theme.

    Also, I see a number of irrational conclusions being put forth by readers -- no value in RAID just replicate your data 3 times? This sounds a bit like how to get home from Oz. It works in the movies. But it doesn't work as well in real life.

    RAID1 is a very solid solution for many businesses (and their correspondent data usage models), especially if there is a hot spare on the system as well. Many studies have shown the business value of the simple, transparent, low cost redundancy that RAID1 delivers. Even simple probability theory will tell you that RAID1 has clear potential for reliability improvements (that are well measured and proven in the real world).

    I see a lot of analysis of RAID5 which people in the real world know is not a good choice for data that matters. There is no sane recovery procedure for RAID5. The drive access patterns tend to result in a lot of vibration as well.

    Overall, I am disappointed that with all the investment that large organizations make in purchasing and deploying storage, they seem to have no one in their organization that (1) understands the mechanics and physics of even a single disk drive, (2) understands the concept of initial conditions, (3) understands the concept of operating environment/conditions (4) has the willingness to make actual measurements vs. barf up a bunch of hearsay, and (5) truly wants to understand the reliability of storage systems vs. take pot shots at the drive industry.

    Each of these papers, CMU's and Google's is incomplete. There is not enough data to support the conclusions. There is not even enough data to support almost any conclusion beyond the basic observation, "drives fail, some days more than others."

And it should be the law: If you use the word `paradigm' without knowing what the dictionary says it means, you go to jail. No exceptions. -- David Jones

Working...