Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Backblaze Dishes On Drive Reliability In their 50k+ Disk Data Center 145

Posted by timothy on Wednesday February 17, 2016 @01:50PM from the learning-from-experience dept.

Online backup provider Backblaze runs hard drives from several manufacturers in its data center (56,224, they say, by the end of 2015), and as you'd expect, the company keeps its eye on how well they work. Yesterday they published a stats-heavy look at the performance, and especially the reliability, of all those drives, which makes fun reading, even if you're only running a drive or ten at home. One upshot: they buy a lot of Seagate drives. Why? A relevant observation from our Operations team on the Seagate drives is that they generally signal their impending failure via their SMART stats. Since we monitor several SMART stats, we are often warned of trouble before a pending failure and can take appropriate action. Drive failures from the other manufacturers appear to be less predictable via SMART stats.

This discussion has been archived. No new comments can be posted.

Backblaze Dishes On Drive Reliability In their 50k+ Disk Data Center

Load All Comments

Search 145 Comments Log In/Create an Account

Comments Filter:

Seagate SHOULD be good at that (Score:4, Insightful)

by damn_registrars ( 1103043 ) writes: <damn.registrars@gmail.com> on Wednesday February 17, 2016 @02:09PM (#51528717) Homepage Journal

Considering how awful their failure rates are in general, they need to get good at reporting them before hand or they (as a company) won't exist much longer. After all, investing in quality is clearly too expensive...

Share
twitter facebook
- Re:Seagate SHOULD be good at that (Score:5, Funny)
  
  by mattventura ( 1408229 ) writes: on Wednesday February 17, 2016 @02:29PM (#51528885) Homepage
  
  Seagates are great at reporting impending failures.
  Does it say Seagate on it? It's about to fail.
  
  Parent Share
  twitter facebook
- - Re: (Score:2)
    
    by BarbaraHudson ( 3785311 ) writes:
    
    What's even weirder is that the HGST drives are from Western Digital subsidiary.
    - Re:Seagate SHOULD be good at that (Score:5, Informative)
      
      by slaker ( 53818 ) writes: on Wednesday February 17, 2016 @02:41PM (#51529009)
      
      HGST drives are manufactured by a different division, using different processes and different engineering teams. I was told by a WD engineer that HGST stuff is still entirely separate on a manufacturing level.
      Of course, I'm just some guy on the internet, but based on my own experiences with a few hundred 3 and 4TB drives in service, the Hitachi/HGSTs are worth going out of my way to obtain and Seagate 4TB drives don't seem to have the problems the 3TB units did.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by BarbaraHudson ( 3785311 ) writes:
        
        I said they were made by a subsidiary of Western Digital, not Western Digital themselves.
        
        Re: (Score:2)
        
        by dave420 ( 699308 ) writes:
        
        ... which isn't "weird" at all, negating your initial point.
        
        Re: (Score:2)
        
        by BarbaraHudson ( 3785311 ) writes:
        
        Not really - when you buy another company for their tech, you usually expand its' use to your other similar product lines. Guess this didn't happen - like when seagate bought maxtor and instead of tearing down and rebuilding the plants, continued running them as is, resulting in high failure rates - the same problem that drove maxtor to be acquired by seagate in the first place.
  - Re: (Score:3)
    
    by damn_registrars ( 1103043 ) writes:
    
    That is more the result of how few manufacturers remain than anything.
Not that surprising (Score:2)

by Kokuyo ( 549451 ) writes:

Around here, Seagate 6TB disks cost 50ish % more than WD Red NAS and Hitachi disks are yet more expensive. So all these graphs are basically in line with the old adage "You get what you pay for".
The comment about Seagate's SMART being more on point seems to make those disks a nice compromise.
Funny enough, considering there is this saying in Switzerland: "Sie geit oder sie geit ned." (where "Sie geit" sounds awfully close to "Seagate") which roughly translates to "It works or it doesn't" and is a stab at the
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  Funny enough, considering there is this saying in Switzerland: "Sie geit oder sie geit ned." (where "Sie geit" sounds awfully close to "Seagate") which roughly translates to "It works or it doesn't" and is a stab at the sometimes abysmal failure rates they had back when.
  Here in the USA, especially around the Monterey Bay Area where Seagate was (and still is) located, we just called them "Seizegate" for the tendency of their drives to fail due to stiction.
- Re: (Score:2)
  
  by itsownreward ( 688406 ) writes:
  
  We have a rack-mountable QNAP NAS device that our field support people back up files to when they are rebuilding a workstation. We used 3T Seagates from the compatibility list in it, and I had constant problems; we've replaced them with WD Reds, and the problems have gone away. Now in retrospect, seeing that Seagate drives report SMART events earlier, it makes sense that I had all the problems. The QNAP firmware drops and refuses to reattach any disk to an mdadm array that has SMART errors. Granted, if
Sorry WD fans (Score:5, Interesting)

by Solandri ( 704621 ) writes: on Wednesday February 17, 2016 @02:21PM (#51528831)

Can't help but feel for all the people who read Blackblaze's previous report and decided Seagate was junk and bought WD instead. I tried to warn them that the model of the drive mattered more than the manufacturer, because each manufacturer tries new technologies and new cost-cutting strategies with each different model. Sometimes it works and the model is reliable. Sometimes it doesn't and the model is unreliable. But everyone was eager to get on the bash Seagate, praise WD bandwagon and ignored me.

Well, WD was least reliable this time around. The Seagate stats in the previous report were probably being skewed by just one or two bad models. It's skewed this time by one bad model, which due to the passage of time means it makes up a tiny portion of their Seagate sample, so doesn't spike Seagate's score like before. (You can pretty much ignore WD in the 4TB graph, as a sample size of just 46 drives means the confidence interval is a 0.3% - 8.8% failure rate.)

At least Blackblaze addressed my criticism from before - they've broken down the stats to individual drive models. And you can see that like I said, there's huge variability in reliability between models within a manufacturer's lineup. Now they just need to add confidence interval to the graphs.

Share
twitter facebook
- Re: (Score:3)
  
  by 0100010001010011 ( 652467 ) writes:
  
  I wish I saw Backblaze's previous report. I have a whole lot of Seagate paperweights [dadatho.me]. I couldn't do anything but laugh when one of their SNs ended in FML [dadatho.me]
  In comparison all of the WD Red's that I bought to replace those (and their warrantied replacements) are still going strong. I did everything 'right'. Spread out my purchases, bought from Newegg and Amazon, kept them cool, etc. I think out of the 12 or so 2 & 3TB Seagate drives my current FreeNAS machine still has all of 1 or 2 still running. And one of
- Re: (Score:2)
  
  by LordKronos ( 470910 ) writes:
  
  I don't know if I'd say it was 1 or 2 bad models that plagued seagate. When I buy drives, I go by the ratings on amazon and newegg, and regardless of the drive model it seems there's always a lot more reviews of seagate drives failing than other brands.
  - Re: (Score:2)
    
    by Gondola ( 189182 ) writes:
    
    The problem with this tactic is that manufacturers will change their manufacturing methodology over time. An extremely well-reviewed model can be replaced later in its product life by a worse version that retains the same exact model number. If you go to NewEgg and Amazon and look at hard drive reviews for the best drives, then look at only the more recent reviews, you may see a big drop in the average rating for some models. Bait and switch. So, be careful!
- Re: (Score:2)
  
  by epine ( 68316 ) writes:
  
  Can't help but feel for all the people who read Backblaze's previous report and decided Seagate was junk and bought WD instead.
  Why feel for them? By your own inefficient market hypothesis, every course of action is a crap shoot. The report was great for me, because we actually had one or two of those highly suspect drives in service.
  But in the larger scheme, you're absolutely right. Every vendor has manufactured a few duds. IBM, Hitachi, Seagate, Western Digital. Every company has made some poor models
- Re: (Score:2)
  
  by craighansen ( 744648 ) writes:
  
  The 3TB Seagate (ST3000DM001) wasn't in the main table because it had a 28%/year failure rate and they've all been retired. It's not that they bought a small number of them - they ripped them out - I've been doing the same. The 4TB Seagate's have been about average in reliability.
This is a repeat of 6/23/15 topic . "When will" (Score:3, Interesting)

by FirstOne ( 193462 ) writes: on Wednesday February 17, 2016 @02:36PM (#51528969) Homepage

""When will your hard drive fail" [slashdot.org]
I pointed out that Blackblaze chassis configuration improperly stressed the fragile SATA/Power connectors by implementing a vertical disk drive mounting configuration, [slashdot.org].
Where the mass of drive(&vibration) is placed upon the fragile SATA data and power connectors.
This type of vertical drive storage/raid cabinet is not conducive for long term/reliable drive lifespan., thus any number of other factors could kick in and cause a premature failure.

Share
twitter facebook
- Re: (Score:3, Insightful)
  
  by Anonymous Coward writes:
  
  Considering they are hitting 5-6 years on a decent population of their drives I think they are doing OK.
Impressive stats for HGST drives. (Score:2)

by nbritton ( 823086 ) writes:

I'm impressed by the HGST drives, less than 1% failure rate. I haven't touched the Deskstar line of drives since the IBM Deathstar debacle, but I think it's time to take a second look. Hopefully they have not switched over to Western Digital's technology.
- Re: (Score:2)
  
  by tlhIngan ( 30335 ) writes:
  
  I'm impressed by the HGST drives, less than 1% failure rate. I haven't touched the Deskstar line of drives since the IBM Deathstar debacle, but I think it's time to take a second look. Hopefully they have not switched over to Western Digital's technology.
  Well, HGST drives are still more expensive than Seagate or WD drives of similar capacity.
  Remember a hard drive is a very high precision mechanical device that has traditional economic pressures applied to them - everyone wants more for less dollars. So the
Bad sectors? (Score:5, Interesting)

by nbritton ( 823086 ) writes: on Wednesday February 17, 2016 @03:00PM (#51529253)

What is Backblaze doing to check the drives for bad sectors? I manage a 10,000 disk openstack swift installation and I've noticed the auto sector remapping doesn't work correctly, there are a portion of drives (maybe 3%) that have a few bad sectors that need to be manually remapped using ddrescue. I ended up having to write a custom monthly cron job script that ran badblocks to first identify these drives, and then ddrescue to force a sector remap.

Share
twitter facebook
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  It may be different with 10,000 disks vs 4 disks, but I wouldn't trust a drive once it has one remapped (or pending remap) sector. I'd be worrying about replacing it, not remapping, because it tends to be a sign of impending failure.
  - Re: (Score:3)
    
    by nbritton ( 823086 ) writes:
    
    It may be different with 10,000 disks vs 4 disks, but I wouldn't trust a drive once it has one remapped (or pending remap) sector. I'd be worrying about replacing it, not remapping, because it tends to be a sign of impending failure.
    Of the drives with sector errors (n = 286) the number of bad sectors typically ranged from 4 to 16, with a median of 8. However, values above 25 bad sectors were statistical outliers, meaning they were more than 3 standard deviations off the normal curve. Our policy now is to replace any drive with more than 25 bad sectors.
- - Re: (Score:3)
    
    by nbritton ( 823086 ) writes:
    
    How are you doing this with ddrescue?
    grep "error.*sector" /var/log/kern.log | awk '{print $(NF-2)$NF}' | sort -u | while IFS=, read device sector; do dd if=/dev/$device of=/dev/null bs=512 count=1 skip=$sector 2>/dev/null || dd_rescue -d -A -m8b -s ${sector}b/dev/$device /dev/$device; done;
    For the badblocks cron job I use this:
    #!/bin/bash
    if [ $EUID -ne 0 ]; then
    echo "you must be root to run this... exiting." exit 1
    fi
    if ! [ -f /sbin/badblocks ]; then
    echo "can't find /sbin/badblocks...
All drives fail, sooner or later... plan for it.. (Score:3)

by FlyHelicopters ( 1540845 ) writes: on Wednesday February 17, 2016 @03:19PM (#51529429)

All things fail, including hard drives. The question isn't "if", it is "when".
Picking between WD or Seagate hoping to get a "good drive" is missing the point, what happens when both drives fail?
Do you have your data backed up?
I run both Crashplan and Backblaze, I also have a copy stored on Amazon Glacier and important files on OneDrive. I also have two external drives that I rotate backups on and keep unplugged.
For most people, what I do is "overkill", but I've lost data before... never again...

Share
twitter facebook
- Re: (Score:2)
  
  by Keiran Halcyon ( 550292 ) writes:
  
  I think the point here isn't that there's a drive or manufacturer out there that doesn't fail. The point here is that with such a huge sample range, you can make somewhat useful trends and comparisons between failure rates on a macro scale that no standard user would be able to do themselves. If you look at 56,000 disks and see that Seagate accounts for a larger percentage of drives and lower equivalent failure rate among manufacturers, you can *generally* expect that buying a drive of an equivalent model a
  - Re: (Score:2)
    
    by FlyHelicopters ( 1540845 ) writes:
    
    While those are fair points, and good advice... I still have a concern...
    I don't think there is a large enough disclaimer that Backblaze runs their equipment in a 24/7 environment that is quite different than most users. Oh sure, they say it and it is there, but I think it deserves highlighting.
    If you look at the percentage failure rates, they are higher across the board than what I've seen. Sure, drives fail, but honestly I have some of those same Seagate drives in a server here and they have been runni
    - Re: (Score:2)
      
      by mattventura ( 1408229 ) writes:
      
      The data might be from more rigorous conditions, but that doesn't make it useless. If a drive model exhibits a low failure rate even under supposedly awful conditions, then that reflects even better on the drive. If anything, I'd be more concerned about ways in which their environment is better than a typical consumer environment, such as how a forced-airflow server in a temperature-controlled datacenter is probably going to keep the drives at a better (or at least more consistent) temperature than some ran
    - Re: (Score:2)
      
      by drsmithy ( 35869 ) writes:
      
      I suspect Backblaze is quite hard on drives and the rates are worse than you'd see outside of that environment. It is also worth noting that those drives are not all installed in the same type of "pod". Backblaze has changed pod designs a few times and now uses an "anti-vibration" system they didn't used to.
      Your typical home desktop/server drive is likely to see a far harsher life than your average Backblaze drive.
- - Re: (Score:2)
    
    by lgw ( 121541 ) writes:
    
    I lost data once too when an IBM Deskstar died suddenly and my backups somehow got corrupted too.
    You don't have a backup until you've tested the restore. The nice thing about simply copying all files to an external drive (with nothing clever going on, just a file tree copy) is that the "restore" is just using the new drive. But that approach doesn't really scale past home/home office use.
    I wish there was a better selection of tape backup software in the world: LTO-7 finally shipped, and a 6 TB (uncompressed) tape is nice.
Replace drives after burn-in testing? (Score:2)

by h4ck7h3p14n37 ( 926070 ) writes:

We actually didn’t retire these 1TB WD drives – they just changed jobs. We now use many of them to “burn-in” Storage Pods once they are done being assembled. The 1TB size means the process runs quickly, but is still thorough. The burn-in process pounds the drives with reads and writes to exercise all the components of the system. In many ways this is much more taxing on the drives then life in an operational Storage Pod. Once the “burn-in” process is complete, the WD 1TB
Drive generation matters and You Are Not Backblaz (Score:4, Insightful)

by Fencepost ( 107992 ) writes: on Wednesday February 17, 2016 @04:27PM (#51529955) Journal

One of the significant notes is that it seems the Seagate 4TB drives are doing much better than some earlier versions, and that WD is no longer doing so well.

Another thing that gets brought up every time one of these is released is "Why are they still using Seagate drives if they're so bad?" and the answer is simple: it remains a balancing act between cost and reliability. Backblaze has the redundancy and processes in place to not worry about single-drive failures, so FOR THEIR USAGE the lower drive cost is more important. If you're on a smaller setup where you have everything on just a few drives with inadequate redundancy, a few dollars extra for better reliability is worth the cost.

When you really get down to it Backblaze is looking at cost per gigabyte per day, and if ($LESS_RELIABLE_DRIVE_COST + $DRIVE_REPLACEMENT_COST) is lower than ($MORE_RELIABLE_DRIVE_COST) then they're going with the cheaper option.

Share
twitter facebook
- Re: (Score:2)
  
  by AmiMoJo ( 196126 ) writes:
  
  For home use it's worth paying a little more of a Hitachi (HGST) drive. They are owned by WD, but use different tech, different factories etc. You pay more but get better reliability.
So, the only reasons to use Seagate (Score:2)

by Chas ( 5144 ) writes:

A: They're cheap
B: They scream really loud before they die, hopefully when someone's listening.
C: They're cheap.
I'll stick with Western Digital and HGST.
If they die off that infrequently in their sweatbox environments, the chances that they're going to die under normal desktop use are orders of magnitude less.
- - Re: (Score:2)
    
    by Chas ( 5144 ) writes:
    
    You're assuming that, in a standard use-case, that warning's going to come early enough.
    You're also assuming that the data extraction won't kill it either.
    In the long run, you're better off with a more reliable drive and a reliable primary backup device.
Consider the conditions - YMMV (Score:3)

by dbIII ( 701233 ) writes: on Wednesday February 17, 2016 @09:25PM (#51531833)

Consider the conditions - this is selecting for the environment of a lot of drives packed into poorly ventilated cases so those that cope best with heat will win.
While heat over time is a common cause of drive failure there are others, so the results are not so useful for drives in desktop cases or in well ventilated servers (eg. ones with hot-swap bays so there is no way to pack the drives in as densely as Backblaze do).

Share
twitter facebook
Core OS (Score:2)

by MrL0G1C ( 867445 ) writes:

SMART monitoring is where modern OSes utterly fail, it should be a core part of OS functionality, the OS should warn you when a SMART stat goes bad but MS et al would rather put some stupid shopping experience into the OS instead.
- Re:Doesn't make any mention of.. (Score:4, Informative)
  
  by Anonymous Coward writes: on Wednesday February 17, 2016 @02:11PM (#51528743)
  
  https://www.backblaze.com/blog/vault-cloud-storage-architecture/
  They mention their architecture here
  
  Parent Share
  twitter facebook
- Re:Doesn't make any mention of.. (Score:5, Interesting)
  
  by brianwski ( 2401184 ) writes: on Wednesday February 17, 2016 @04:21PM (#51529901) Homepage
  
  Brian from Backblaze here.
  
  The individual drives in our datacenter run ext4 (the OS is Debian). We do an extremely simple Reed-Solomon encoding that is 17+3 (17 data drives and 3 parity) but the 20 drives are spread across 20 different computers in 20 different locations in our datacenter. This means we can lose any 3 drives and not lose data at all.
  
  We released the Reed-Solomon source code free (open source but even better) for anybody else to use also. You can read about it in this blog post: https://www.backblaze.com/blog... [backblaze.com]
  
  Parent Share
  twitter facebook
- Re:RAID, let them fail (Score:5, Insightful)
  
  by Dareth ( 47614 ) writes: on Wednesday February 17, 2016 @02:18PM (#51528809)
  
  The purpose of RAID is to keep data available for a purpose. You have some level of redundancy measured in terms of number of disk that can fail before you have a data loss for the array. Once a disk has an impending failure smart alert, you no longer have full confidence in that disk. If you leave it to fail, what if another disk in the array happens to fail. You now have an array with a failed disk, possibly in a degraded mode. You also have a disk with a better than normal chance of failure. It just makes sense to be proactive and fix the issue before it escalates into a failure.
  
  Parent Share
  twitter facebook
  - Re:RAID, let them fail (Score:4, Insightful)
    
    by Old97 ( 1341297 ) writes: on Wednesday February 17, 2016 @02:47PM (#51529071)
    
    Yes, and if one disk in an array fails, the likelihood that another disk in the same array will fail soon goes way up. That's because they many disk failures are related to environmental factors - power, air, particulate matter, etc. Whatever factors contributed to the first disk failure are also present for the other disks in the array. So it's best to replace disks that have impending failure as soon as you can.
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by Shawn Willden ( 2914343 ) writes:
      
      Yes, and if one disk in an array fails, the likelihood that another disk in the same array will fail soon goes way up. That's because they many disk failures are related to environmental factors - power, air, particulate matter, etc.
      Even, more, the process of rebuilding a degraded array is very intensive, touching every sector of every disk in the array, old and new. This means that if there are any latent failures that just haven't been noticed, the rebuild process will find them with very high probability. RAID is good, and useful, but as soon as there's a hint of a failure on any disk, you should replace it ASAP. This is also why I favor RAID modes that allow for more than one failed disk. That way if you have one failure and the re
      - Re:RAID, let them fail (Score:4, Informative)
        
        by Shawn Willden ( 2914343 ) writes: on Wednesday February 17, 2016 @03:31PM (#51529545)
        
        Oh, one more thing: You should also ensure that every sector of every disk is read regularly. There are more sophisticated options available, but just setting up a cron job that does something like "cat /dev/sdX > /dev/null" on every drive once per week or so is a reasonable and very simple approach. The goal is to trigger failures early, before they get too bad.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by thhamm ( 764787 ) writes:
        
        I run SMART short tests every day and SMART long tests once a week, but now i'm not sure, i thought the 'long tests' check all sectors?
        
        Re:RAID, let them fail (Score:4, Interesting)
        
        by brianwski ( 2401184 ) writes: on Wednesday February 17, 2016 @09:01PM (#51531727) Homepage
        
        Brian from Backblaze here.
        
        Sometimes the "drive failure" is as simple as the little circuit board on the bottom of the hard drive has a component die. This won't be predicted by SMART stats at all. We have chatted very informally with the people at "Drive Savers" ( http://www.drivesaversdatareco... [drivesaver...covery.com] ) and they say one of the early steps in attempting to recover the data from a drive that won't work is to replace the circuit board with the board from an identical hard drive of same make and model.
        
        I have no affiliation with "Drive Savers" but from my interactions with them I trust them as quite a good and valuable service who know their craft. We even used them once in a panic once to get back the minimum number of drives for data integrity in a RAID array (a long time ago before our multi-machine vault architecture). It worked - we got all the data back from the drive!
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by mattventura ( 1408229 ) writes:
        
        If you're going to use a brute-force solution like this, run it through ionice so that it doesn't suck up all the disk bandwidth.
        
        Re: (Score:2)
        
        by Trogre ( 513942 ) writes:
        
        Don't forget to prefix that with a:
        ionice -c 3
        or you'll kill your performance.
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        ZFS seems like a good solution to all this. As well as having RAID-like levels of redundancy, it checksums all data (not just files, even FS metadata) and can check it on a schedule. What you really care about is your files being intact, so it's better to checksum those than to rely on the disk itself detecting bad sectors. That test will also pick up things like a bad SATA cable or failing enclosure.
        I just wish something that good was available for Windows.
        
        Re: (Score:2)
        
        by craighansen ( 744648 ) writes:
        
        Or if you're using Software RAID on Linux, just do a resync weekly. Which will also read every sector on every drive with the bonus of making sure that all drives report back good information.
        Most hardware RAID cards have a similar feature to check the array for errors.
        mdadm already does a "checkarray" starting at 00:57 on the first Sunday of each month by default. See /etc/cron.d/mdadm
- Re:RAID, let them fail (Score:5, Informative)
  
  by sexconker ( 1179573 ) writes: on Wednesday February 17, 2016 @02:55PM (#51529195)
  
  Because you don't know how it will fail, you don't know what other drive may fail next, and you don't know when a 2nd, 3rd, nth, drive will fail.
  Further. drives that manage to actually report that they're dying are typically fucked to the point of impacting your performance significantly. If you're still writing to a drive that's hobbling along, it will slow down the whole array.
  Reads are usually okay (depending on your controller and setup) but writes need to be completed at some point, regardless of your cache scheme or cache size.
  Sustained writes to an array with a crippled drive will eventually either result in the drive being taken offline or the array's write performance turning to shit. If you're lucky, the drive is taken offline gracefully, doesn't catch fire, and you do the hot spare / cold spare dance, the rebuild boogaloo, etc.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Stolpskott ( 2422670 ) writes:
  
  BackBlaze might have their own alternative reasons, but in my case ... Because whether you are using RAID, the Reed-Solomon setup that BackBlaze are using, or no distributed data system at all, it is easier/quicker to recover data directly from a drive that is showing signs of failure than it is to restore from a backup or recover from a RAID parity check.
  Yes, it means that I am removing drives from my arrays that still have useful life in them, but they get repurposed - I am quite frequently asked by frien
- Re:Not very useful. (Score:5, Funny)
  
  by Anonymous Coward writes: on Wednesday February 17, 2016 @02:35PM (#51528947)
  
  Exactly, so even though these are the best large scale numbers we have, they are garbage. We shouldn't use them even though they are the largest sample size. They're useless like the people that carefully compiled these numbers. Instead, we should trust drive manufacturer's marketing numbers, as you suggest.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  If you pick something that doesn't fail under their extreme circumstances, it's a lot less likely to fail at home.
  I should have used their report to plan my home RAID. I have 4 x 3TB Toshiba drives. And 3 of them are the DT01ACA300, which is a little less reliable than the Seagates they chose (but thankfully way better than Western Digital). I didn't even buy the 3 drives from different vendors. I bought 3 in one place, likely from the same batch. The fourth was an external drive, scavenged for its int
  - Re: (Score:2)
    
    by dbIII ( 701233 ) writes:
    
    If you pick something that doesn't fail under their extreme circumstances, it's a lot less likely to fail at home.
    That depends entirely on the failure mechanism.
    If their drives are running hot for very long periods of time and you have very well ventilated case in comparison then the extreme test may not be very relevant.
    Conversely if there's a drive prone to failure from frequently powering up and down then their results from running 24/7/365 wouldn't pick it up.
    - Re: Not very useful. (Score:2)
      
      by omnichad ( 1198475 ) writes:
      
      I do run 24/7/365, so most of their data will just be more extreme. I think that it's better than no starting point at all for buying drives for a home server.
      - Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        I'd better be a bit more blunt and clear.
        For a variety of reasons BackBlaze pack more drives into each case than others consider sane and they have a lot of heat with very minimal airflow. It may make sense with their business model but it's not a typical server environment.
        So the results are selecting almost purely for drives that handle high temperatures better than others. While that may simulate rapid ageing for some sorts of defects they are mostly going to fail due to the conditions that most drives
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        For a variety of reasons BackBlaze pack more drives into each case than others consider sane and they have a lot of heat with very minimal airflow.
        Backblaze have their drives in datacentres with ambient temperatures in the low 20s C, probably less. I'd be surprised if their drives got over 30C.
        Most home PCs/servers would be lucky to keep their drives _under_ 30C unless they're somewhere where the ambient is quite low.
        
        Re: (Score:3)
        
        by omnichad ( 1198475 ) writes:
        
        From Backblaze's own mouth:
        After looking at data on over 34,000 drives, I found that overall there is no correlation between temperature and failure rate.
        To check correlations, I used the point-biserial correlation coefficient on drive average temperatures and whether drives failed or not. The result ranges from -1 to 1, with 0 being no correlation, and 1 meaning hot drives always fail.
        Correlation of Temperature and Failure: 0.0
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        Yes, the Google study found the same.
        However, I suspect this is mostly because in datacentres drives simply don't get hot enough for heat to become a factor - rarely over 30 degrees.
        In home PCs and servers, 40-50 degrees C is quite common. Hard disks in machines like iMacs regularly get over 50 degrees C.
        
        Re: (Score:2)
        
        by omnichad ( 1198475 ) writes:
        
        Mine are in the basement, and are at 35 degrees right now. And I have 5 drives crammed into a mini tower (OS + 4 drive RAID). Not too far off from their test range, so I'll be using their help in a year or so when I replace all my drives. I probably won't even need more storage by then, but I hate having old spinning drives. The last upgrade was so I could rip my Blu-Ray collection.
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Imagine if you had the air output from those five drives feeding into another five, then another five, then five more. Now turn down the speed of the single row of cooling fans. The backblaze usage of write once, read not so much can survive in that situation and just shed a drive every now and again but even your home server is likely to get in trouble with such a design.
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        It appears you don't understand the situation since they don't have typical servers but have drives packed in very tightly in multiple rows. Ambient temperature is what happens outside the case. If airflow is very poor then there are pockets at a much higher temperature inside the case. It can be 50C+ in there since the 20C air can only trickle through the gaps and is heated by each successive row of drives in instead of only having a single row of drives at the front of the case.
        They can get away with i
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        It appears you don't understand the situation since they don't have typical servers but have drives packed in very tightly in multiple rows.
        I'm quite aware of "the situation" as I've been watching Backblaze's designs for years and I've spent 15+ years dealing with Tier 1 server hardware. They're quite clearly inspired by Sun's X45xx Thumper series, with vertically stacked drives. Yes, the drives at the back will get hotter than those at the front, but it is difficult to see them getting anywhere close to
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Yes, the drives at the back will get hotter than those at the front, but it is difficult to see them getting anywhere close to 50
        I've seen it happen.
        Remember we are describing shoving drives in anywhere they will fit instead of a server case designed by someone that went somewhere near a technical college or university for anything other than coding.
        Yes, the drives at the back will get hotter than those at the front
        What about the nest row, or the one after - 45 drives jammed in tight and almost no airflow.
        I
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        I've seen it happen.
        Anecdotes are not data.
        They don't seem to have seen any problems [backblaze.com]
        "About a year ago, we took a group of Storage Pods and removed the 3 fans at the end, leaving just three middle fans to cool the unit. We placed these pods into production and monitored the temperature of the hard drives utilizing the SMART stats we take each day. Nothing changed, as the drives stayed cool and didn’t fail at higher rates."
        The average drive temperatures are overwhelmingly in the 18-26 degree range. Th
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Anecdotes are not data
        You gave a blanket opinion of it not happening so a single data point is enough to tell you that your opinion does not always reflect reality. Your failed oneupmanship with the resume stuff is also an anecdote BTW and is a bit of an odd thing to do in a place like this.
        I'm sure a company with a massive business interest in designing high-capacity storage servers
        You don't seem to have been following the thread. My point has always been that it is for a specific use-case that does not
        
        That link (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        That link is the average of all the drives, including the ones that did not fail, over a long time period and has been discussed elsewhere in this thread.
        An average sadly doesn't tell us as much as you appear to think it does, especially about failed drives since the data is diluted by the ones that did not fail and since the drives are idle for so much of the time. Maximum temperatures of the ones that failed would support your argument but that is not what you are using.
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        You gave a blanket opinion of it not happening so a single data point is enough to tell you that your opinion does not always reflect reality. Your failed oneupmanship with the resume stuff is also an anecdote BTW and is a bit of an odd thing to do in a place like this.
        Your "someone tried to copy a Backblaze pod and got it wrong" is hardly a powerful counter-example.
        There's no evidence to suggest their pod design has heat problems. None. That - and a non-trivial amount of experience with a wide variety of
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        There's no evidence to suggest their pod design has heat problems
        Do I have to keep on repeating myself - for what they do it makes sense but for general usage take a look at their first design - utterly insane if those disks are getting a lot of use at one. 45 disks packed in with very little airflow due to not much in the way of fans and disks stacked in direct contact with very little space between physical piles of disks. Almost nothing in the way of fans. Stagnant air in corners and edges. If that m
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        Do I have to keep on repeating myself []
        You can repeat yourself as much as you want, but it doesn’t make you right.
        The data says Backblaze don’t have any heat problems.
        Backblaze explicitly say they don’t have any heat problems and have done as long as I’ve been reading about them (they obviously track the necessary data to know and have no reason to lie that I can see).
        Experience says that even slow moving 30-35 degree air over a group of drives that would otherwise be running at 45-
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        The data says Backblaze don’t have any heat problems.
        I've written about the data and so have you. You pointed out that the average temperature looked very high to you. I pointed out that it's an average over a wide range of conditions and doesn't tell us enough to justify a statement that they don't have any heat problems - and now I'm suggesting that the very high average that you noticed is probably due to much higher temperatures when they are not idle skewing the average up.
        no worse than a stand
        
        Re: (Score:2)
        
        by drsmithy ( 35869 ) writes:
        
        I've written about the data and so have you. You pointed out that the average temperature looked very high to you.
        I did not.
        I said I’d be surprised if they got over 30. They are getting over 30, but not by much. The vast bulk of drives are clearly under 30. So I am surprised, but not by much.
        “Very high” to me would be, as I have alluded to several times, a substantial percentage of drives into the high 30s to low 40s. Which would be represented in the stats by that great big fat chunk
        
        I should never have taken your words at face value (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Instead of jumping on threads to play some wank of a mass debate game where you attempt to convince people of things contrary to reality why don't you do something useful, or at least less annoying? You are in slimy confidence trickster preying on the weak territory and nowhere near the "Devil's Advocate" you are probably telling yourself.
        You had me going for a while and I really did thing you were as dim as your posts suggest but the bit about working drives not getting hot was a clue that you do not beli
        
        Re: (Score:2)
        
        by omnichad ( 1198475 ) writes:
        
        They keep their drives between 20 and 30 degrees Celsius and find no correlation at all between drive failure and temperature: https://www.backblaze.com/blog... [backblaze.com]
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Since the temperature is going to vary wildly over the 45 drives in each "pod" something that would be far more indicative would be the average temperature of the drives that actually failed instead of all of the drives.
  - Re: (Score:2)
    
    by craighansen ( 744648 ) writes:
    
    Your 3TB Toshiba drives are way better than the 3TB Seagates (ST3000DM001) - Backblaze had a cumulative failure RATE of 28% - that's 28% failed per year. In my experience, they are ALL going bad before their third year of use. Backblaze has taken them all out of service, and mine are now paperweights. I do concur with Backblaze that most of them showed SMART failures before they died.
  - Re: (Score:2)
    
    by Biolo ( 25082 ) writes:
    
    I'm running the Toshiba DT01ACA300 drives mentioned in the report, not had a single one fail over several years of usage. Compare that to the Seagate ST3000DM001, also in that report, I had 10 of them at one point, and over 4 years 90% of them failed (not counting those replaced in the first year under warranty!). They report a nearly 30% failure rate, which is comparable to my experience. Only one Seagate left, and I expect that will be gone within the year (it's got a hot spare waiting to take over when i
    - Re: (Score:2)
      
      by omnichad ( 1198475 ) writes:
      
      couple of HGST HDS5C3030ALA630, which became the DT01ACA300 Toshibas
      I think that's how I ended up buying Toshiba. When I realized that it's what had been HGST, which has had a good track record for a while now (especially on Backblaze's report).
      it's helped to remind me not to so easily write off the entire companies drives.
      Except Western Digital, right? What happened to the Red drives? They run slow and don't have intentional anti-RAID hobbling. They should be decent choices. If they weren't overpriced, I would have gone with those (thankfully I didn't).
  - Re: (Score:2)
    
    by drsmithy ( 35869 ) writes:
    
    If you pick something that doesn't fail under their extreme circumstances, it's a lot less likely to fail at home.
    I said this elsewhere but it might get lost in the noise.
    Your typical home PC or server drive will likely see far, far harsher conditions than any Backblaze drive.
    So take their conclusions with a grain of salt, especially the ones around heat.
- Re: (Score:2)
  
  by b0bby ( 201198 ) writes:
  
  That may be so, but my experience with the 3TB Seagates mirrors theirs - they were the worst drives we ever used in our RAIDs.
  - Re: (Score:2)
    
    by dgatwood ( 11270 ) writes:
    Yeah, early Seagate perpendicular storage drives had serious problems, including (supposedly) some firmware bugs that made the problem worse. This was about the same time period where I lost five or six drives in the same year, all Seagate. I stopped using their hardware after that, and haven't looked back. Good to know that their reliability has gotten back to acceptable levels since then, but they should never have shipped that junk.
    The things that stand out to me in that data are:
    
    The Toshiba drives
    - Re: (Score:2)
      
      by dbIII ( 701233 ) writes:
      
      Correct me if I am wrong, but aren't the Toshiba drives still just rebadged Hitachi drives?
      - Re: (Score:2)
        
        by dgatwood ( 11270 ) writes:
        
        Apparently, they are based on designs that they acquired from WD as part of WD's acquisition of HGST, but I wouldn't go so far as to say that they're rebadged Hitachi drives. After all, HGST is owned by WD, not Toshiba. So the answer is kind of convoluted. :-)
      - Re: (Score:2)
        
        by nerdbert ( 71656 ) writes:
        
        You're wrong. Toshiba's desktop/server line has very, very little in common component wise with HGST, and their own design and manufacturing centers. In fact, HGST has relatively little in common with their now parent WD as far as hardware goes (although I suspect that will change now). Toshiba drives have more components in common with WD than with HGST.
  - Re: (Score:2)
    
    by KingMotley ( 944240 ) writes:
    
    Same here. I've just about phased out the 3TB Seagates (The ST3000DM001 variety). Had absolutely horrible fail rates with that model. In fact, I have a class action notice sitting on my counter regarding that particular model as well. They were so bad that even when they would fail within the 1 year warranty period I refused to send them back. I just replaced them as they failed. I believe 1 single drive now remains.
- Re:Not very useful. (Score:5, Informative)
  
  by brianwski ( 2401184 ) writes: on Wednesday February 17, 2016 @04:15PM (#51529863) Homepage
  
  Disclaimer: I work at Backblaze.
  
  > very unlike the type of use case you will likely see
  
  Being extremely specific - we (Backblaze) keep the drives powered up and spinning 24 hours a day, 7 days a week. So if you leave your drives powered off most of the time and boot them only sometimes, the failure rates we see may or may not be something like yours?
  
  I'm curious if anybody has any other suggested differences with "what you will see". Most of our drive activity is light weight - we archive data for goodness sake, we write the data once then maybe read it once per month to make sure the data has not been corrupted. We stopped using RAID a while ago, so you can't say you need drives that are designed for RAID, because we don't use RAID (we do a one time Reed-Solomon encoding and send it to different machines in different parts of our datacenter and write it to disk with a SHA1 on this "shard" where that shard lives it's life independently without RAID).
  
  ANOTHER POINT MANY PEOPLE MISS -> you can't just pick the lowest failure rate drive and then skip backups!! *EVERY* drive fails, every single solitary last drive. So you must have a backup if you care about the data, you really really do. And if you have a backup, then you are free to choose a drive that fails at a higher rate if there are other considerations such as it is a much cheaper drive. Hint: Backblaze doesn't always choose the most reliable drive, we look at the total cost of ownership including the amount of power the drive will consume and the drive's failure rate and let a spreadsheet kick out the correct drive for us to purchase this month. It is rarely the most reliable drive.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    I very much agree on the backups. Sure, catastrophic failures are rarer today, but they happen and only backup protects you.
    - - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Well, if you have verified backups (I fully agree on that, and just a trial-read of the backup is not enough, you need a compare), and then no recovery plan, you can still pay somebody with a clue a lot of money to do it for you ;-)
        External disks are fine, but make that in an independent location. A locker at work, in the gym, etc. is fine (you should decidedly encrypt any sensitive backup). Of course, you can have on-site backups for convenience as well, but for everything you really do not want to lose, y
  - Re: (Score:2)
    
    by wbr1 ( 2538558 ) writes:
    
    Interesting stuff.
    I work for a small break/fix shop that is transitioning to MSP (2000 nodes and counting). I have made drive recommendations based off of your data for some time now. While our clients tend to use smaller consumer grade drives, once can make a generalization that manufacturing quality follows brand to some degree regardless of capacity and model. Anecdotally, our experience fairly closely matches the numbers you have given in past reports, but any numbers I generate are not corrected f
    - Re: (Score:3)
      
      by nerdbert ( 71656 ) writes:
      
      I worked in the drive industry for almost two decades, and I've made this comment before, but I'll make it again.
      Modern disk drives detect failing sectors automatically. They go through increasing complex recovery schemes to recover that data (I know one company used to use around a dozen unique methods with various parameters) and once that data is recovered they remap the failing sector onto spare tracks. All without the knowledge of the user, and without triggering SMART (unless the sector is unrecoverab
      - Re: (Score:2)
        
        by wbr1 ( 2538558 ) writes:
        
        Good info. Some questions though. If a drive's firmware is reallocating without modifying SMART numbers, what is the reallocated sectors count for? Additionally, if the drive waits until it is out of spare tracks before incrementing the stats, how can it ever reallocate sectors? Of course the manufacturers can do quite a bit of invisible hand waving in the firmware.
        I guess one could test for this by benchmarking sequential reads and writes across the drive surface when new, then again later. If there
  - Re: (Score:3, Interesting)
    
    by Anonymous Coward writes:
    
    Backblaze doesn't always choose the most reliable drive, we look at the total cost of ownership including the amount of power the drive will consume and the drive's failure rate and let a spreadsheet kick out the correct drive for us to purchase this month. It is rarely the most reliable drive.
    Do you factor in the work cost? In an environment where services are bought by the hour, the cost of a single maintenance operation is more than the cost difference between the most expensive drive in a selected class and the drive of an average cost.
    - Re:Not very useful. (Score:4, Informative)
      
      by brianwski ( 2401184 ) writes: on Wednesday February 17, 2016 @11:41PM (#51532395) Homepage
      
      Brian from Backblaze here.
      
      > Do you factor in the work cost?
      
      Yes. And I think the mods were being unreasonable to vote you down, it is a fine question!
      
      We have enough drives (56,000+ all in one datacenter) so that we need a team of 4 full time employees working inside the datacenter to take care of it. If we purchase a drive with higher replacement rates, we will need to hire more datacenter techs, so it gets entered into the equation. ANOTHER area this comes up is server design: most datacenter servers put the drives mounted up front for fast and easy replacement without having to slide the computer around. Our pods put 45 drives accessed through the lid of the pod which means it takes longer to swap the drive - the pod is shut down, the pod is slid out like a drawer, some screws or (most recently) a tool less lid is detached, the drive is swapped, then repeat backwards to put the pod back in service. We did the math, and we feel there is (significant) cost savings that outweighs the additional effort and time to replace the drives. Front mounted (traditional) is something like 1/3rd the drive density with what we have, which means the datacenter space bill would be 3x larger but we would hire fewer datacenter techs.
      
      Parent Share
      twitter facebook
- Re: (Score:3)
  
  by Archangel Michael ( 180766 ) writes:
  
  Actually, since they actually KEEP their stats, they are the most reliable information on drive failures. My rough experience is similar, about 5% failure rate across the fleet. Some drives, last long time, others not so much. Same drives.
  My take on this is that Backblaze has dispelled plenty of myths about drive lifespans. I don't really trust anecdotal evidence offered by geeks (including my own above!)
- Re: (Score:2)
  
  by 0100010001010011 ( 652467 ) writes:
  
  ZFS is what saved my ass. No SMART warnings. No other indication of a failure other than my scrub going "Eh, your drives are shit, we took them out of the pool".
  - Re: (Score:2)
    
    by Gondola ( 189182 ) writes:
    
    I want to switch to ZFS, but I'm not sure how ZFS handles failure on the boot drive, and my Google searches weren't very successful in answering the question either.
    - Re: (Score:2)
      
      by 0100010001010011 ( 652467 ) writes:
      
      I haven't had any issues. I've randomly pulled a boot drive and ZFS doesn't complain. I use FreeBSD with ZFS on root.
    - Re: (Score:2)
      
      by Zargg ( 1596625 ) writes:
      
      Should work just as you would expect, from my experience with FreeNAS. If there is a boot drive with errors, it will use the good copy or whatever parity it has to boot up and inform you of a bad boot disk. If there are no more good copies then it can't boot. You can do all the normal scrubs on them to catch drives going bad.
    - Re: (Score:2)
      
      by craighansen ( 744648 ) writes:
      
      ZFS on Ubuntu is problematic because it doesn't properly rebuild the kernel modules when the kernel is upgraded.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Seagate SHOULD be good at that (Score:4, Insightful)

Re:Seagate SHOULD be good at that (Score:5, Funny)

Re: (Score:2)

Re:Seagate SHOULD be good at that (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Not that surprising (Score:2)

Re: (Score:2)

Re: (Score:2)

Sorry WD fans (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

This is a repeat of 6/23/15 topic . "When will" (Score:3, Interesting)

Re: (Score:3, Insightful)

Impressive stats for HGST drives. (Score:2)

Re: (Score:2)

Bad sectors? (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

All drives fail, sooner or later... plan for it.. (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Replace drives after burn-in testing? (Score:2)

Drive generation matters and You Are Not Backblaz (Score:4, Insightful)

Re: (Score:2)

So, the only reasons to use Seagate (Score:2)

Re: (Score:2)

Consider the conditions - YMMV (Score:3)

Core OS (Score:2)

Re:Doesn't make any mention of.. (Score:4, Informative)

Re:Doesn't make any mention of.. (Score:5, Interesting)

Re:RAID, let them fail (Score:5, Insightful)

Re:RAID, let them fail (Score:4, Insightful)

Re: (Score:3)

Re:RAID, let them fail (Score:4, Informative)

Re: (Score:2)

Re:RAID, let them fail (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:RAID, let them fail (Score:5, Informative)

Re: (Score:2)

Re:Not very useful. (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: Not very useful. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

That link (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I should never have taken your words at face value (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)