Some HPE SSDs Fail After 3 Years and 9 Months, Company Warns (hpe.com) 113
New submitter AllHail writes: HPE SAS solid state drives are affected by a firmware problem which causes these drives to stop working after 32768 power-on hours (3 years and 9 months). If these drives are not flashed with updated firmware before the failure, the drives and the data on them become unrecoverable at that time. If several of these drives are installed and operated together in a RAID, they are going to fail almost simultaneously. Patch or assume the risk of failure, says Hewlett Packard Enterprise.
That's an oopsie (Score:2)
That 'Enterprise' label on the name is sitting rather loosely today.
Re: (Score:1)
Re: (Score:2)
How exactly do you mitigate this, though? You tend to buy storage servers already loaded with drives, and those drives will have about the same number of power-on hours.
No, the right solution is to not have a completely idiotic bug that irrecoverably trashes the drive after a certain number of power on hours. HPE should be liable for this and I hope they get sued into the dirt.
Re: (Score:1)
How exactly do you mitigate this, though?
Hot spares.
Granted you still have the POTENTIAL risk of too many drives failing while rebuilding, but for servers where it's really, really important that data not be lost - you want it to be set to rebuild immediately after a failure.
My primary DB server has 8 drives - 6 of them are in a RAID-6 config with 2 setup as hot-spares. I'd need to experience 3 drive failures before one of the spares finishes rebuilding before I lost any data (and in that case I've got backups to LTO-7 tapes as well - it's just t
Re:That's an oopsie (Score:4, Insightful)
Yes, that's how you address issues with regular failures. As your storage array ages, you expect more failures, but not all at the same time.
This is different, because it's a bug. The drives don't fail with greater frequency as they get older, they all drop dead at 32768 hours, presumably because some dunce coded some critical value as a 16-bit int. You're *definitely* going to get lots of drives failing simultaneously in that situation.
Re: (Score:2)
Re: (Score:3)
No amount of hot spares will help when your entire array, including the hot spares themselves, dies at the same time.
Re:That's an oopsie (Score:4, Insightful)
The problem is that hot spares are just that, hot -- they're powered on and just not participating in storage, so they would be affected by a bug influencing power on time.
I know most IT people, myself included, are sort of bought into the idea that power cycles kill drives, no powered on hours, but I still wonder if it would make more sense to have "hot spares" sit powered off most of the time. They could be spun up and some interval, with some basic I/O testing to validate they are functional, and then spun back down.
I'm kind of doubting that a monthly spin up/spin down cycle would materially alter the drive's reliability -- in a five year life cycle, this is only 60 power cycles, and mostly moot for solid state devices. It wouldn't materially change an array rebuild time -- most of the time those are measured in hours or longer and the spin up time is what, 30 seconds?
For spinning rust, this would contribute to power savings, cooling savings, and might even improve the lifetime of spare drives which could be sitting in a powered on state for years without any use.
Re: (Score:2)
What does "spin up" and "spin down" even mean in the SSD context? Surely they enter an idle state when not in use which nonetheless keeps them instantly accessible, so what sysadmin would want to physically deny them power? This doesn't even change that, it would just make me say "HPE is on the shit list alongside Seagate -- no more drives from them" and replace before the power on limit.
Re: (Score:2)
This is not like normal failures. Where a particular drive will fail due to mechanical problems, and with a good raid environment you would have days or weeks to fix the problem. This is all your drives failing at once. This is in terms of storage the worst possible scenario. Heck if you had an offsite external site powered on at the same time, your backup location would be down too.
Re: (Score:2)
That's not as much of a rule as you think. I've had a single drives in batches of 8-10 fail at between 5-7 years into use and once replaced the others continue to work for years more (typically the server is retired from production use at that point, but we will still run non-critical things like testbeds on them).
That said, no the hot spares in that system were purchased a year after the server went online.
Re: (Score:1)
Re: That's an oopsie (Score:1)
Re: (Score:2)
Re: (Score:2)
Maybe it's an even bigger oopsie to have drives with the same power on hours, so they'd all fail simultaneously. Is burning in drives too tedious to have the increase in redundancy it offers? (Even in mechanical drive RAIDs.)
OK, let's pretend you didn't know about this bug. That no one did. And you staged your new RAID deployments as you've suggested. I'd certainly find it discomforting to find a hard drive failing each week, every week, until the whole RAID array was eventually replaced.
I'd probably be questioning the entire damn chassis by the 4th or 5th drive failure regardless of array size.
Re: (Score:2)
Although it's a corporate truism that executives rarely look much beyond the current quarter and its financial results. So SSDs that last for at least 12 quarters must seem almost immortal.
Re: (Score:2)
In financial time 12 quarters is equal to 4 executives and three golden parachutes. That means it's the next next next guy that will take the hit. And no one liked that guy anyway.
Re: (Score:2)
That's IBGYBG [urbandictionary.com] thinking at work. It's fine to eat the golden goose, as long as the consequences land on the next guy.
Re: (Score:2)
Re:That's an oopsie (Score:5, Interesting)
it always has been an unjustified label. I have seen rack after rack of drive failures for all sorts of explainable reasons.
We have drives every bit as capable in the consumer space. The only problem is that scalability is limited arbitrarily because "reasons". The IT world is full of this mentality. Server OS vs Workstation OS... it has been and always will be a farce concocted to create the idea that something is more valuable than the other.
Ebay did it right the first time. They designed their software to survive hardware failure because they would dumpster dive for systems to use and they failed often.
the year 2019 and we still operate like hardware can never be allowed to fail. Failure should be built in and software should be created with the idea that underlying hardware will fail and to recover from it. The way things are now... the "Enterprise" grade crowd now gets to charge a premium for essentially nothing!
Re: (Score:2)
the year 2019 and we still operate like hardware can never be allowed to fail. Failure should be built in and software should be created with the idea that underlying hardware will fail and to recover from it.
Yup. The fact is that "failure" is one of the few things you can count on.
Re: (Score:2)
That's a completely different risk model from having the entire fleet of drives keel over within a day of each other, though.
Re:That's an oopsie (Score:4, Funny)
It's "enterprise" as in "criminal enterprise".
Re: (Score:3)
For anyone who works professionally knows that "Enterprise" Label is a warning of overpriced and poorly designed products, but created by a company large enough to hire so many lawyers in its terms of services that you have a snowballs chance in hell of winning any legal liability against it.
Re: (Score:2)
Re: (Score:2)
Wow, he's as big as a whole generation now, congrats Chris!
Nobody has a larger collection of zombie rat trolls as pets than creimer. Does he even feed you guys, or do you sneak out to dumpster dive while he's sleeping?
There are just never enough bits in your integers (Score:3)
Re: (Score:2)
And that's why one should use languages that don't have those isssues.
Re: (Score:3)
Re: (Score:2)
While true, there are languages that adapt the size of the integer to the value contained. But that comes at the cost of overhead. So they're not what you want for an embedded system.
That said, neither C nor C++ tell you the size of the integer. Some external and hardware specific libraries do, but the languages themselves don't. (I don't know the current status, but the guarantee used to only be that a long int would be at least as long as an int, and a short int wouldn't be larger than an int, and it
Re: (Score:2)
Re: (Score:2)
You may also know this type because compilers got stricter with printf-types and having to use %llu when printing out a uint64_t.
I know this one again and again and again. It is so much easier to learn a new trick than to change an old one.
Re: (Score:2)
but sizeof(char) = sizeof(short) = sizeof(int) = sizeof(long) = sizeof(long long) is what I said.
However if the specific inttypes are now a part of the language, then I'm referring to an obsolete standard. I thought they were a recommended extension or something, and that compilers for embedded systems didn't need to implement them.
Re: (Score:3)
Re:There are just never enough bits in your intege (Score:5, Insightful)
An additional question might be "why does the functioning of the drive critically depend on the power-on counter anyway?"
Re:There are just never enough bits in your intege (Score:4, Insightful)
Re: (Score:2)
Yeah, that's my suspicion too. It would be hilarious if it was intentional as you suggest.
Re: (Score:2)
Re: (Score:2)
To be fair, most of them have spare sectors and various cell-failure detection mechanisms to mark bad cells and swap in others. The flash should wear out long before the controller electronics do though, so most SSDs should slowly lose capacity rather than fail catastrophically.
Re: (Score:2)
Never seen one write so much that the NAND wore out.
Data forensics consultants who desolder the NAND chips and recover the data externally don't usually find any problems with the NAND, either.
It is always the controllers that die, even though they're low power CMOS-type ICs that get almost no wear and tear from operation, and in other types of electronic devices they last decades under normal conditions.
Re: (Score:2)
Where are the SSD metadata stored? Mapping tables, SMART counters, stuff like that. If the mapping table failed first, the drive would just die, right?
Disclaimer: I have no idea what I'm talking about.
Re: (Score:3)
There was a similar case that may explain why.
Power-on hours is a useful metric for predicting drive failure, which is why it's commonly included in SMART data. Especially for enterprise they may just replace the drive after a certain number of power-on hours because they know that failure rates go up at a certain point.
In the similar case when the number went negative it caused a sanity check in the firmware to fail. The firmware would assume that the data was corrupt and try to re-load it in an endless lo
Re: (Score:2)
Re: (Score:2)
In engineering, if the thing fails when X doesn't work, then it depends on X. Agreed, they may not have done it on purpose, but it also means they didn't do a thorough design review, which is supposed to be one of the things that differentiates "enterprise" from "consumer" and justifies the sticker shock.
Re: (Score:2)
Re: (Score:2)
So you want a language that doesn't have the possibility of bugs?
I call the language Ohmmm, it accepts inputs in this reality, and then sleeps until humanity evolves into pure energy beings, and then it will provide the output.
If you think you have a bug, you don't, you're just not waiting patiently enough.
Also, it doesn't need a compiler, you just write your pseudocode or flow chart anywhere you want and then you wait. Patiently.
Re: (Score:1)
Re: (Score:2)
Well, it will be fine for everything that has transitioned to 64 bit hardware. Well, almost everything. But if you coded something that needs to be long on 32 bit hardware as a short, you don't have grounds for complaint.
The thing is, lots of embedded systems don't see any need for 64 bit hardware, and it's not clear that they will detect any problem ahead of time. Imagine the havoc if one day all the Alexi's stop working. Or the IoT locks. etc. (I'm assuming that things like GPS satellites have alrea
Re: (Score:3)
Re: (Score:2)
That is awesome. I wish it was more common.
Re: (Score:2)
It amazes me how often basic stuff like overflowing counters is not tested in embedded code. It amazes me that people don't look at anything counting time and immediately calculate the maximum period before the counter fails to see if it is safe.
This smacks of incompetence. It's 15 bits so they are presumably using an int16. Might even be an int... Most of these controllers are ARM 32 bit based but you do get 8 and 16 bit cores doing power management stuff. Anyway, there are only two explanations for pickin
There are just never enough bits in your SSD (Score:2)
So on a 1TB hard drive, there were just 15 bits to count uptime ?
That's hilarious.
Handy if you're likely to get sued... (Score:1)
Those SSDs could command a premium for some very special applications. Such as storing super vital top secret corporate (or government) information that might have to be embarrassingly revealed in legal proceedings down the road. Such as DNC strategy documents from 2016... (although they could always ask the Russians for copies ).
"Whooops, bad luck, that data was unfortunately stored on some of our top-end HPE SSDs, and now it's just a fading memory".
As HPE itself will soon be, no doubt.
Re: (Score:1)
It's easy. They let their printer division do the firmware.
Re: (Score:2)
Hey, sorry, Democrats! I didn't mean to hurt your feelings.
How about quoting the useful part? (Score:4, Informative)
” HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected. 3PAR, Nimble, Simplivity, XP and Primera are not affected.”
There’s also a table of specific models with manufacture dates in TFA.
Re: (Score:2)
I guess people should stay tuned...
Actually, I would hope that any enterprise sysadmin running HPE gear would already have signed up to be on HP's support alerts mailing list.
Yikes (Score:3)
This is why I default to RAID10 with SSDs, and make sure to use 2 different drive types in each pair.
Re: (Score:2)
This is why I default to RAID10 with SSDs, and make sure to use 2 different drive types in each pair.
That's quite interesting. RAID10 is basically a grouping of RAID1 pairs. Due to bad experiences in the early days of RAID I have always made it a standard practice to avoid mixing drive types in RAID1. In fact, I've even been reluctant to mix different firmware revisions on the same drive models.
I have used mismatched drive models here and there out of necessity, but generally as a stopgap measure. I'm curious to know what other people make of this practice.
OR... (Score:1)
Re: (Score:1)
Re: (Score:2)
THAT won't achieve it by itself, but I inferred that the implication is that it will force a lot of people to install said "patches" which probably comes as a binary blob with who knows what in it.
Re: (Score:2)
Re: (Score:2)
3PAR, Nimble, Simplivity, XP, Primera not affected (Score:1)
3PAR, Nimble, Simplivity, XP and Primera are not affected. Phew
Re: (Score:2)
Wait, was "Simplivity" seriously a product name?
OMG it's worse than that, it's an IT services company.
I thought for sure that was thrown in to make fun of stupid names for products.
This is a major corporate failure (Score:3)
You have *got* to be joking. Where I used to work, before I retired, I'd have been running in circles, worrying about 170 or so servers and workstations, figuring out which ones had them. Updating firmware was something we did *only* when it might fix an issue. I can't imagine a large shop... or, for that matter, a large cluster with that issue.
This is where HPE should be *required* to notify all purchasers, and all vendors who sold drives, or systems with the drives, to contact the purchasers "proactively", not wait for the failure.
Oh, that's right, that would be "burdonsome regulation"....
Re: (Score:1)
Great news! This absolutely will fix the issue!
So your position is that they should have notified people about the bug first and discovered it later?
No, that would be violating the b
Re: (Score:2)
> Oh, that's right, that would be "burdonsome regulation"....
If you're an HP shop and this happens to you, then never buy HP again.
That's called "customer regulation" which economists recognize as the most pervasive and effective form of regulation.
That said, people only buy HP servers and drives and support so they have somebody to blame when everything goes to hell and their boss is breathing down their necks. We're about to see how well that theory holds, apparently.
Re: (Score:2)
So you couldn't do something as basic as keeping a list of what hardware was installed on which servers?
Re: (Score:2)
Sure, in an ideal world every corporation would be appropriately staffed for the size of the infrastructure. Unfortunately that's the exception rather than the rule. I myself in a past life (OK... just over a decade ago) became the sole sysadmin/engineer/architect at a surprisingly large company running infrastructure on about 100 physical servers in 20 different locations plus a couple of storage arrays. I spent so much of my time with my hair on fire trying to resolve issues that the idea that I would hav
Re: (Score:2)
In my experience you need to keep on top of firmware updates, because vendors will invariably put the phone down after telling you to update the firmware before they will progress the support call. Scrabbling around doing firmware updates when you have a problem is not fun.
Anyway I work in the public sector in Scotland and the Scottish government has mandated that we have CyberEssentials Plus. so I have to suck it up and apply updates in a timely manner regardless.
Re: (Score:2)
This is where HPE should be *required* to notify all purchasers, and all vendors who sold drives, or systems with the drives, to contact the purchasers "proactively", not wait for the failure.
Based on responses here in Slashdot, they are doing precisely this. Look I'm all for government regulation, but this wouldn't be burdensome as much as completely pointless.
Okay (Score:2)
"...which causes these drives to stop working after 32768 power-on hours..."
Okay, if that's not hilarious I don't know what is.
That number...it's almost as if I'd seen it before...
Re: (Score:2)
Yes it's the first byte of the character generator ROM in the VIC-20 memory map!
Good catch!
Re: (Score:3)
The VIC-20 truly was a marvel, doing things that haven't been duplicated since. Such as storing 32768 in a single byte.
Re: (Score:2)
The VIC-20 truly was a marvel, doing things that haven't been duplicated since. Such as storing 32768 in a single byte.
I still remember how to do this, it is called a lookup table.
Math is wasteful, it should be avoided in software. And in hardware.
2 ^ 15 (Score:1)
"31000 hours ought to be enough for anyone"
-Gill Bates
HP is failing (Score:5, Interesting)
Right now we have PC orders as far back as April which are still unfilled. We have other orders where only certain parts of the order have arrived.
After the new Intel chips came out we were told the rate at which we receive equipment would get better and even closer to our SLA which says 15 business days. As yet, in this entire year, not once has HP come close to getting us our orders in the time their contract states.
We were told earlier this month by our third-party provider, HP printers will take at least a month to arrive after they are ordered.
Now HP says their SSDs will fail in less than 4 years, without any warning at all.
And yet, for whatever reason, Xerox wants to pay a premium for HP because of "synergies".
HP is dying, and no one at HP cares.
Re: (Score:2)
Right now we have PC orders as far back as April which are still unfilled. We have other orders where only certain parts of the order have arrived.
That's a different HP though. HP split into HP and HPE a while back. HP does workstations, desktops, laptops, printers, etc. HPE is servers, networking, storage, etc. They are totally separate companies these days.
Re: (Score:2)
I understand they are different companies, but both are still run by the same people as before.
Thus the issues I described.
Re: (Score:2)
After the new Intel chips came out we were told the rate at which we receive equipment would get better
You're buying systems with Intel chips in them on purpose? You're part of the problem.
Regardless, HP has been garbage since... well, basically forever. HP-PA was the worst performing architecture among its fellows. They squandered DEC completely. I had a HP laptop I got from an employer and it took me literally about 24 hours of phone time to get it replaced when the GPU failed due to a known problem (G71 die bonding failure.) It was kind of working (crashing on overheat, so I could use it as long as I used
Warranty (Score:3)
The warranty period is only three years. Why is this a problem?
Re: (Score:2)
Where I live there is also the legal warranty, which is usually much longer, but you might have to go to court to enforce.
Good grief... (Score:2)
.
For starters ... who uses a signed integer to count something?
Re: (Score:2)
Top bit could be parity.
Yeah, I'm grasping...
Re: (Score:1)
Anyone with a clue.
Unsigned values in C and C++ are a leading cause of bugs, including security bugs.
Unsigned size_t was a major flaw. STL size() returning an unsigned value just made it worse.
Anyone who codes in C or C++ knows this.
Re: (Score:2)
Re: (Score:2)
Anyone who codes in C or C++ knows this.
That's the most ignorant thing you could possibly say about the subject.
Congratulations.
Gateway to forced obsolescence (Score:3)
Could this be a test (deliberate or not) of software based disabling of a product one day after the warranty to force you to buy the new one? This is like when they remove existing app features in an update to force you to pay for the Pro version or the upgrade.
Should it be legal? No protection for the consumer on that ?
Who is/are the SSD OEMs? (Score:2)
The HPE support doc covers what HPE customers need to do, but everyone else needs to be wondering: is this just specific to HPE firmware (pre-HPD8), or are other SSDs in the wild at risk?
Since HPE probably didn't manufacture the SSDs, who did? Or is it not possible to tell?
Recall? (Score:3)
Yup (Score:5, Interesting)
Got 32 servers at work, filled with these drives running a vSAN. And 30 days to patch them. HP actually called us yesterday. I think that’s a first that they called us about problems.
Re: (Score:2)
HP actually called us yesterday. I think that’s a first that they called us about problems.
Their liability insurance company probably told them that they have to do it, and they have to comply to maintain their policy in this most critical hour.
Re: (Score:2)
Sounds like a probable explanation.
All manufacturers have problems (Score:2)
Once we had ~50 enterprise grade SSDs from a large vendor fail at the same time (I cannot name the vendor), we lost an entire DC (thankfully no client impact).
It was a firmware bug that showed up under our particular workload.
Re: (Score:2)
If you're buying HPE storage arrays, HPE will only support it if you have qualified HPE drives in the array. You don't get the option of mixing and matching drives from different vendors, they all have to be HPE and they (generally) all have to be from the same series.
It has long frustrated me that you can't put your own storage in HP (and other enterprise-type vendors) arrays. Why should you have to pay anywhere up to 5x to 10x the market rate for drives that have the magically "qualified" and "supported"
Guess the ball is on operating system vendors now (Score:2)
I think linux should detect the hw model and warn users about that problem, before people get their ssd's failing.
Wow, I'm glad I paid extra for "Enterprise" SSDs (Score:2)
Wow, I'm glad I paid extra for "Enterprise" SSDs. I mean, it'a all of HP's special sauce that they bake into the firmware of an otherwise unremarkable consumer drive that makes it worth 5x to 10x the market rate for flash storage. It's great to see that HP are actually making their own firmware instead of shipping the drives with the OEM's firmware on them.
These must be Nexus-6 drives (Score:2)
The Nexus-7 drives don't have the lifespan limitation.