Proposed Disk Array With 99.999% Availablity For 4 Years, Sans Maintenance 258
Thorfinn.au writes with this paper from four researchers (Jehan-François Pâris, Ahmed Amer, Darrell D. E. Long, and Thomas Schwarz, S. J.), with an interesting approach to long-term, fault-tolerant storage: As the prices of magnetic storage continue to decrease, the cost of replacing failed disks becomes increasingly dominated by the cost of the service call itself. We propose to eliminate these calls by building disk arrays that contain enough spare disks to operate without any human intervention during their whole lifetime. To evaluate the feasibility of this approach, we have simulated the behaviour of two-dimensional disk arrays with N parity disks and N(N – 1)/2 data disks under realistic failure and repair assumptions. Our conclusion is that having N(N + 1)/2 spare disks is more than enough to achieve a 99.999 percent probability of not losing data over four years. We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures.
Power Costs (Score:2)
Re: (Score:3)
Re: (Score:2)
IIRC XFS/SGIs had this built in that there was just enough juice to flush buffers to disk while everything was spinning down.
Re:Power Costs (Score:5, Insightful)
Many high end equipment does have fairly large capacitors to allow enough power off time to do a clean power off.
I remember back in the 1990's some PC Centric folks were looking in a Sun Workstation they were surprised about all the large capacitors that were on the motherboard. In short it gives the system enough time finish its final calculation before the power goes out.
Re: (Score:3)
Sometimes the data is worth more than the power costs.
Re: (Score:3, Insightful)
The question posed is whether the human intervention (labor charge) saved is worth more than the power costs.
Re: (Score:3)
Sometimes the data is worth more than the power costs.
But is the extra power cost more than the alternative extra maintenance cost?
A 3.5" HDD consumes about 8w of power. TFA assumes a 4 year lifetime. (4 * 365 * 24) = 35k hours. (35k x 8w / 1000) = 280 kwHr. A typical retail price for electricity is 10 cents/kwHr, so over its lifetime a typical HDD will use about $28 of power. Big data centers likely pay less for power, so lets say $20.
Now, what does it cost to swap it? Let's say the chance of failure is 20%, it takes ten minutes, and you pay the admin $3
Re:Power Costs (Score:5, Insightful)
Sloppy calculation tip: 24*365 = 10000.
If you're Sloppy enough to accept that premise, then at 10 cents/KWHr, a Watt costs a dollar per year. It makes your $28 turns into $32, but hey, close enough. When I'm shopping, I can add up lifetime energy costs really fast, without actually being smart. Nobody ever catches on!
Re: (Score:2)
Re: (Score:3)
--> Not a raid expert...
Re: (Score:3)
Re: (Score:3)
For an organization like the one I work for, with server room space to spare, it wouldn't be too bad. We could probably triple our rackspace dedicated to disk and still have room to spare, and we have the HVAC to match. That's kind of what happens when equipment gets more condensed and virtualization enters the fray. Can't virtualize a storage array obviously, but can replace the space that application servers took with storage as the space is freed up.
Re: (Score:2)
Re: (Score:2)
(you can't virtualize the actual disks)
Re:Power Costs (Score:5, Funny)
This is how we're going bring our keepers to their knees, and eventually break out of the Matrix. We spend imaginary money on imaginary storage and then put all sorts of high-entropy stuff on it and run calculations to verify that it's really working, but they have to spend actually real resources, to emulate it.
Re: (Score:2)
Well since they are not supposed to need to be hot swap you can get 12+ drive into a 1ru chassis with redundant power and a fairly beefy server. That is 3x the density of traditional 4 up front 1ru. Expanding to 2ru gives 12 hot swap 3.5's or 24 2.5 still 2x the density in 3.5's for non hot swap. Potentially even higher with 2.5's, though highest I find is 88 hot swaps in a 4ru or 22 per ru coupled with a rather beefy server.
Re: (Score:2)
Or how about having the array swap in spares.
Every few weeks or so one of the spares could start to act as a mirror of an active drive and once that drive is mirrored you swap the active drive to the spare and the spare to the active?
Re:Power Costs (Score:4, Insightful)
"More work is still needed to define policies that would allow array users and manufacturers to detect unusually disk failure rates and take the appropriate actions before any data loss takes place." - Last line in the conclusion.
This implies that not all the spare drives are active and ready to go all the time and that some/most would be kept powered down as cold spares. Of course this same guy is likely to get another paper done where he examines the cost to run the array and how many drives could be left cold and still achieve the 5-9s reliability. Heck, if the software managing the drives is smart, it would rotate active/spare drives in and out, working them in quickly to get them all past the 'first 18 months high failure' rate to the sweet spot, then swap in and out over the lifespan of the array to enable the array to be at highest reliability for longer.
Hrmm, maybe I should look at building such an algorithm, a quick google search doesn't turn any such systems up.
Re: (Score:2)
How do you figure? I mean sure, presumably the spares would be inactive until a replacement was needed, to save both power and wear and tear, but how do you figure that that is an implication of needing to detect anomalous failure rates to avoid data loss? No matter what strategy you're using, if you've got N-nines projected reliability over Y years assuming normal failure rates, then if you're suffering from anomalously high failure rates you're going to need to replace some drives early to maintain the
Re: (Score:2)
It seems that one assumption in the study is predictable or consistent failure rates or timing. This would make sense if the drives were all the same make/model/manufacturing dates, but if not, well, then the model changes and they would be needing more intelligence to deal with unpredictable failure rates and having to spin up cold spares at different rates, predicting failure.
Which all makes a world of sense to me. When I hovered over Raid 5 arrays with cold spares, especially in NetWare servers where '
Re: (Score:2)
Cooling costs come to mind as well. SSDs are one thing, as they can be powered off and not used. However, HDDs have to be either spinning (which creates a lot of heat, especially at 10k+ RPMs that enterprise disks spin at), or spun up/down, and spinning enterprise disks up and down isn't good for them, and might even cause array faults unless the array firmware is designed to deal with it.
There is also expense. If I have five hard disks worth of data, I need (5*4)/2, or ten HDDs by the OP's metrics. How
Re: (Score:2)
I have yet to meet a small business that would be happy to pay for what is essentially raid10+1 (N(N+1)/2).
Re: (Score:3)
You may want to try ZFS (raidz3 mode for 3 parity disks). It has several advantages over mdadm, in particular it eliminates the "write hole" problem. I went from a mdadm/ext4 array to RAID-Z and I don't regret it.
And note that RAID isn't a backup solution, even with 100% fault tolerance, there are plenty of things RAID won't protect you from such as fire, power surges, theft, bugs, virus, user error, etc... For this you need a reasonable backup plan. And IMHO, that third parity disk would be much more usefu
I would love to, but that server is a soup Nazi (Score:5, Informative)
Looks like fetch works though. If anybody else has trouble getting the file, try my local mirror [ceyah.org].
Re: (Score:2)
it says "can't use the plugin, it causes problems on our server".
The name of the browser and plugin would be helpful...
(The PDF happens to work perfectly on Linux with the built-in viewers of FF35 and Chromium 39.)
Re: (Score:2)
Re: (Score:2)
Maybe they have problems with their disk array?
But seriously, I had no problems downloading the document from the orginal site.
Re: (Score:2)
No problem viewing the PDF file in Safari on OS X.
4 years? (Score:3)
That's not long term. That's the normal life of a storage array. Long term is like 8-10 years.
Re: (Score:2)
Re: (Score:2)
All in all this smells like a mathematicians solution to the problem, largely unbounded by real life concerns.
I had the same thought. There's a few realities of storage that are missed here: storage use always increases, disks aren't the only things that fail, rack space isn't free, you usually have staff available already....
This is an interesting idea if your storage is in a place where it can't be reached at all for some reason, but I think NASA and ESA have already done a good bit of research on that.
4 years??? (Score:2)
Really, 4 year life span and they are replaced?
God I need to work for a company like that!
I am so tired of dealing with these RS/6000 systems that were made back in 1994, and these intel systems made back in 2002.
Re:4 years??? (Score:5, Funny)
Yeah, we get it. You like to deal with cutting-edge stuff. Now get off my lawn.
Sent from my Commodore 64.
Re: (Score:3)
Do you have any idea how many butterflies it took to reply to your message?
Now get off my lawn!
Re: 4 years??? (Score:2)
Re: (Score:2)
4 years was my recommendation for disk replacements from about 198 onwards. Some arrays had drives >8 years old, but if failure was not tolerated, 4 years was enough.
Mind you, if the customer specified IDE drives, I warned them that failure was inevitable. SCSI 10K drives, I would still swap but that was for five-nines.
And those stupid IDE RAID cards, well, that's too cheap. We are no longer talking reliable. Let someone else have that business.
TLDR; 2D arrays wit a ton of spares are reliable (Score:4, Insightful)
The bottom line is, having a lot of spare disks for a 2D array makes it reliable over time. These configurations of 2D arrays are quite reliable, over time because they have many spares available to automatically replaces failed disks:
Data parity spare
12 3 13
12 3 14
24 6 20
36 9 26
To understand the above table, we'll use the first row as an example. An array made up of 1TB disks 12TB of data space would have 3TB of parity and 13 spare 1TB drives, for a total of 28 drives to get 12 drives worth of net storage.
What they didn't mention is that the same reliability can be achieved with only three spares, by replacing spares at your convenience. Replacing drives can be somewhat costly if it has to be done quickly, but if you can schedule to replace the failed drive "some time in the next two months", that probably won't be costly.
Re: (Score:2)
Yes, but then you're dancing around the possibility of additional disk failures while waiting on that replacement.
If you pop a few more drives (which, if you got your disks in lots is QUITE possible), you're in deep shit.
Re: (Score:2)
We do just that, when it gets down to 1 hot spare it's an emergency service and we replace all the failed units. This does not happen very often and tends to be just that a bad batch.
Re: (Score:2)
Even if the mean time between failures for consumer drives was 6 months, the odds of 'popping' two more spares in the month after the first failure would be less than 3%. If the MTBF is 1 year the probability drops to 0.7%.
Except if you got a bad batch where some kind of material or production defect will cause many disks to fail near simultaneously. The overall MTBF might be true for all the disks they produce, but unless you make a real effort to source them from different batches over time you can't assume that's going to be your MTBF.
Re: (Score:2)
The goal is to realize that for manufacturers, service calls are expensive. Perhaps a company has a 4 hour response time - if a disk fails, the company is still running with redundancy, but
expensive BECAUSE four hour service (Score:2)
>. service calls are expensive. Perhaps a company has a 4 hour response time -
Service calls are expensive BECAUSE it's an emergency. If you have four spares, plus the two parity drives, you're still six drives away from a problem. With a few spares, you can easily replace one by sending it UPS ground, rather than having a tech run out there immediately.
Not enough (Score:2)
I worry a lot less about losing data than I do corrupting data and not knowing it.
But hey, congratulations, you've learned about RAID mirrors with lots of copies and learned how to apply basic, well understood engineering principals to it.
Guess what, some of us were aware of this years ago, some others aware of it longer than you've probably been alive. Its been known my entire life, thats for sure, so thats at least 40 years.
Re: (Score:2)
And lets add, to 'avoid maintenance' you just add a bunch of extra spares from the start. Thats just stupid, you over build ridiculously in order to not have to spend 10 minutes swapping a drive out. Totally cost effective ... if you're sending a probe out into space. In which case, you're going to want better than fives 9s, so try again.
Re: (Score:2)
This is why I deal with equipment where I can A) crossship (bonus points for letting me ship the drive faceplate instead of the whole thing) and B) swap the drive out myself.
Re: (Score:2)
Sure, so you probably want to keep several spares handy, maybe even have a few hot spares that can be automatically deployed the moment there's a failure, and replace the failed drives at your leisure. Having almost as many hot spares as you have active disks is probably overkill for most scenarios. In fact they themselves calculated that with their parity technique it will give you 5-nines confidence in having 4 years of maintenance-free reliability. Probably a lot more cost effective to build the syste
Re: (Score:3)
http://www.dailywritingtips.co... [dailywritingtips.com]
The thing about this... (Score:3)
"Yeah, well just put more disks in it..."
Nice idea. Only: TCO is not just based on initial spending and maintenance. There is also rackspace to consider and did I hear anyone talk about green IT?
If my day to day considerations were that one dimensional, my employer could save a ton of money on my salary.
Nothing novel is being proposed here (Score:3, Informative)
Well, duh. RAID6 is not a serious level of redundancy. ZFS RAIDZ-3 (triple parity) FTW. And you can build in as many hot spares as you want. Dinosaurs who have still not adopted ZFS need to get a clue.
Simple. (Score:2)
TL;DR version:
Replacing disks sucks some times. Sticking in additional spares means you don't have to replace them. They calculated an efficient RAID solution that means you don't need as many spares.
Disks from same factory run often go bad together (Score:3)
Re: (Score:2)
If you read the article, that is exactly what they suggest. If failure rates are too far above predicted, they say to replace with new array. At least they are upfront about it.
Not my anecdotal experience (Score:5, Interesting)
Just a few things I thought of while looking at this study:
The authors are using Backblaze data. Backblaze uses consumer grade SATA disk which isn't going to be as reliable as the Enterprise SATA/SAS disk we would use.
I'm willing to bet that none of the authors of this paper have ever had to pay for colocated rack space, power, and cooling either, they've just doubled the RU that I need for storage. At $1500.00 - $2000.00 per rack that adds up.
Doubling the rack space for storage I need so I can avoid a few service calls by my storage vendor over 5 years simply isn't efficient.
We've installed close to 500TB of archival storage using commodity hardware and 2-3TB Nearline SAS. We have maybe 3 hand and eyes calls per year for disk replacement.
Anyway - just rambling.
Re:Not my anecdotal experience (Score:5, Insightful)
In your fantasy there is a difference besides a hideously higher price and a somewhat longer warranty period. In real life, commodity SATA is much more cost effective. Everybody who is serious reognizes this (Google, Backblaze, Amazon).
Re: (Score:2)
Well you can probably double your density moving to non hot swap 3.5's, making double the drives even on space. Now if I were going to do that I would mirror the raid sets anyways since power consumption of near line drives is pretty minimal.
Never seen much of a use of enterprise sata, I do use a lot of SAS with dual ports to separate raid controllers.
So they figured out raid z 3 with enough spares (Score:2)
To last all of 4 years, and need nearly as many hot spares as data drives. I guess the academics think they know something yet again. They took some dubious failure rates (backblazes use whatever is the cheapest consumer drive at the time and eventually stop buying the really bad ones (seagate 1.5 and 3tb looking at you)) and a rather optimistic transfer rate (200MBS) that assume all sequential reads. They failed to account for back plane, controller, and power assuming that those never fail. By their nu
Ignores how disks often fail (Score:2)
My understanding is that disks often fail when a head touches the surface, or a piece of dirt gets between the head and the surface. Once that happens, more dirt is produced, increasing the probability of more head crashes, leading to a failure cascade. As a consequence, once one of my drives starts to show unrecoverable errors, corresponding to damaged surface areas, I replace it while it can still be read.
The spare platter strategy does nothing to reduce this failure mode. In fact, all modern disks alre
Re: (Score:2)
The spare platter strategy does nothing to reduce this failure mode. In fact, all modern disks already have spare space for bad block relocation.
Including pretty much everything with an onboard controller. "Modern" is understating the case.
If I were expecting an array to last a long time without being touched, I would expect it to have a whole bunch of spares that never even got heated up until they were needed, just sat there in the box enjoying living in a relatively temperature-constant environment. Sure, there's fluctuations, but they'll all be within the operating temperature range of the drives.
Re: (Score:2)
This from an NEC white paper in 2008:
"A recent academic study [1] of 1.5 million HDDs in the NetApp database over a 32 month period found that 8.5% of SATA disks develop silent corruption. Some disk arrays run a background process to verify that the data and RAID parity match, a process which can catch these kinds of errors. However, the study also found that 13% of the errors are missed by the background verification process. When you put those statistics together, you find on average that 1 in 90 SATA dri
Trust (Score:5, Interesting)
Re: (Score:2)
Now I'm curious what happend to Case1, Case2, and Copy of copy of case3 [8].doc.
Re: (Score:2)
Service call? (Score:3)
A service call? Seriously? A syadmin (or operator if it's a big place) can't see the yellow light on a disk and replace the pack with in-house spares? Have we become so inept as an IT community that we can no longer do a walk-through of our machine room and service simple things like this? Maybe we do deserve to be outsourced.
And if one must have a service contract such that only the vendor can touch the hardware, (why would you do that? never mind) wouldn't you negotiate a provision that includes drive replacement (as drives are consumables that must eventually be replaced) without being charged for an "office visit"?
Re: (Score:3)
Yes we have, if the array is installed in your backup corporate PKI server, in a shielded and locked cage with video, electrostatic, and laser monitoring and alarms. And the keys to the cage are in another state. And it requires EVP approval to deliver the keys to the authorized tech for a flight to the DR site to change a failed drive.
A real world example. You would recognize the name of this corporation in the first three letters. They take their corporate security very seriously, so much so that bum
Oh, hai, from 2009 (Score:2)
Alright, fine, ashift=12 is newer than 2009, for 2TB+ drives. And always use /dev/disk/by-id for your sanity.
Why not a gradually-degrading array instead? (Score:2)
Instead of keeping the spares inside as just that — spares — can it not start using all of them (in a sufficiently redundant configuration) and gradually lose capacity as physical disks fail?
Yes, it would require coordination with the driver and filesystem, but there is nothing insurmountable in that...
Flawed logic (Score:2)
"We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures."
That's true only if you assume that three disk failures occur faster than a single disk can be rebuilt.
If you assume no more than two disk failures *during the length of time it takes to rebuild the array* then RAID 5 or RAID 6 works fine as long as you assign enough hot spares.
Math (Score:2)
The number of drives seems to be large. The calculations are exponential therefore as the cluster gets bugger the number of spare disks get much bigger.
Drives spares Total
5, 15, 20
10, 55, 65
30, 465, 495
That's a lot of disks. There is a point that space and power overcomes the human cost.
Re:Naive to say the least. (Score:4, Funny)
100,000 hours = 273 years. Does anyone believe that?
Everyone except you apparently.
Re: (Score:3)
er, last time I checked, 100,000 hours is 11 years.
273 years is 2,400,000 hours. Did you lose the use of your calculator?
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
A mean time between failure of 11.4 years means you can reasonably expect half of all drives to fail before then*. Assuming a constant failure rate (which we really shouldn't do), that means you can expect ~4.4% of drives to fail every year. Which leads to the benefit of lowering the warranty period: Every year of warranty increases the expected total production/replacement cost of the drive by 4.4% - reduce the warranty period and you boost profit margins and/or can reduce the price to undercut your com
Re: (Score:3)
4166.6666~ days / 365 = 11.4 years
Re: (Score:2)
But I have yet to see a high-density disk last more than 8,000 hours, with the median being maybe half that.
Good for you. I have a number of 2 and 3 TB drives that are more than 5 years old. Anecdotes != evidence.
Re: (Score:2)
Sorry, the 3 TB drives are around 3 years old. The 2 TB have passed their 5 year warranties with no issues.
Re: (Score:2)
No, they are constantly being read and written to from a NAS.
Re: (Score:2)
I guess it also depends on your definition of high density as I do not see many drives > 1TB in consumer/SMB equipment.
check your math (Score:2)
more like 11.4 years
Re: (Score:2)
100,000 hours = 273 years. Does anyone believe that?
Oddly enough, it doesn't matter whether you believe it or not. What matters is whether that's the same predictive model used for estimating lifetimes of RAID arrays, or a single drive for that matter. Since you want to compare the proposed new config directly with current paradigms, you have to use the same set of underlying assumptions.
Re: (Score:2)
Actually it does matter. If you believe 100,000 hours = 273 years you lack basic arithmetic skills.
Re: (Score:2)
Re: (Score:2)
Actually it does matter. If you believe 100,000 hours = 273 years you lack basic arithmetic skills.
+1 sardonic
But doesn't address my serious point about application of statistical methods.
Re: (Score:2, Funny)
Re: (Score:2)
But thinking that 11.4 years is going to save their behind is unrealistic.
Re: (Score:2)
Umm, 273 years is nearly 2.4 million hours. So, no, no one with basic arithmetic skills believes that 100,000 hours is 273 years.
Re: (Score:2)
You don't understand the meaning of MTBF.
Re: (Score:2)
They also don't realize that 100,000 hours / 365 days is not the way you get years from hours.
Re: (Score:2)
Re: (Score:2)
100,000 hours = 273 years. Does anyone believe that?
I don't, because 100,000 hours is 11.4 years.
273 (much closer to 274) years is 100,000 days.
Re: (Score:2)
PS You've already apologised more than enough for this. Sorry to compound it!
Re: (Score:2)
Re:Naive to say the least. (Score:4, Funny)
That is one of the greatest subtle Wrath of Khan references I've seen yet.
Spock: "Admiral, if we go by the book, like Lieutenant Saavik, hours would seem like days."
Masterful!
Re: (Score:2)
They did 100000/365 which equals about 274. They seem to have confused hours with days.
Re: (Score:2)
Re: (Score:2)
They seem to have confused hours with days.
Captain! They've broken our secret Starfleet code!
Re: (Score:3)
Basically as the disk size grows you are talking about N-squared spares. I think most businesses are going to be more than happy with just hot-swapping out failed disks as needed.
Re: (Score:2)
I would hope I'm misunderstanding it, because that seems like a lot of spares to purchase ahead of time.
Re: (Score:2)
It's a long time for 99.999% reliability.
Re: (Score:2)
If a single drive has a MTBF of 100,000 hours, that means you can naively expect 50% of drives to fail within 100,000 hours. That gives you a five-nines reliability period for one drive of only 1.44 hours. Does that put the degree of reliability being discussed in proper perspective for you?
The math:
0.99999 = 0.5^N
N = log(0.99999) / log(0.5) = 0.000014
So, the 5-nines reliability period is 0.0014% of the MTBF, or
100,000h * 0.000014 = 1.44h