Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage HP IT Technology

Some HPE SSDs Fail After 3 Years and 9 Months, Company Warns (hpe.com) 113

New submitter AllHail writes: HPE SAS solid state drives are affected by a firmware problem which causes these drives to stop working after 32768 power-on hours (3 years and 9 months). If these drives are not flashed with updated firmware before the failure, the drives and the data on them become unrecoverable at that time. If several of these drives are installed and operated together in a RAID, they are going to fail almost simultaneously. Patch or assume the risk of failure, says Hewlett Packard Enterprise.
This discussion has been archived. No new comments can be posted.

Some HPE SSDs Fail After 3 Years and 9 Months, Company Warns

Comments Filter:
  • That 'Enterprise' label on the name is sitting rather loosely today.

    • Maybe it's an even bigger oopsie to have drives with the same power on hours, so they'd all fail simultaneously. Is burning in drives too tedious to have the increase in redundancy it offers? (Even in mechanical drive RAIDs.)
      • How exactly do you mitigate this, though? You tend to buy storage servers already loaded with drives, and those drives will have about the same number of power-on hours.

        No, the right solution is to not have a completely idiotic bug that irrecoverably trashes the drive after a certain number of power on hours. HPE should be liable for this and I hope they get sued into the dirt.

        • How exactly do you mitigate this, though?

          Hot spares.

          Granted you still have the POTENTIAL risk of too many drives failing while rebuilding, but for servers where it's really, really important that data not be lost - you want it to be set to rebuild immediately after a failure.

          My primary DB server has 8 drives - 6 of them are in a RAID-6 config with 2 setup as hot-spares. I'd need to experience 3 drive failures before one of the spares finishes rebuilding before I lost any data (and in that case I've got backups to LTO-7 tapes as well - it's just t

          • by ceoyoyo ( 59147 ) on Tuesday November 26, 2019 @01:09PM (#59457878)

            Yes, that's how you address issues with regular failures. As your storage array ages, you expect more failures, but not all at the same time.

            This is different, because it's a bug. The drives don't fail with greater frequency as they get older, they all drop dead at 32768 hours, presumably because some dunce coded some critical value as a 16-bit int. You're *definitely* going to get lots of drives failing simultaneously in that situation.

          • No amount of hot spares will help when your entire array, including the hot spares themselves, dies at the same time.

          • by LostMyAccount ( 5587552 ) on Tuesday November 26, 2019 @01:39PM (#59458096)

            The problem is that hot spares are just that, hot -- they're powered on and just not participating in storage, so they would be affected by a bug influencing power on time.

            I know most IT people, myself included, are sort of bought into the idea that power cycles kill drives, no powered on hours, but I still wonder if it would make more sense to have "hot spares" sit powered off most of the time. They could be spun up and some interval, with some basic I/O testing to validate they are functional, and then spun back down.

            I'm kind of doubting that a monthly spin up/spin down cycle would materially alter the drive's reliability -- in a five year life cycle, this is only 60 power cycles, and mostly moot for solid state devices. It wouldn't materially change an array rebuild time -- most of the time those are measured in hours or longer and the spin up time is what, 30 seconds?

            For spinning rust, this would contribute to power savings, cooling savings, and might even improve the lifetime of spare drives which could be sitting in a powered on state for years without any use.

            • by Mal-2 ( 675116 )

              What does "spin up" and "spin down" even mean in the SSD context? Surely they enter an idle state when not in use which nonetheless keeps them instantly accessible, so what sysadmin would want to physically deny them power? This doesn't even change that, it would just make me say "HPE is on the shit list alongside Seagate -- no more drives from them" and replace before the power on limit.

          • This is not like normal failures. Where a particular drive will fail due to mechanical problems, and with a good raid environment you would have days or weeks to fix the problem. This is all your drives failing at once. This is in terms of storage the worst possible scenario. Heck if you had an offsite external site powered on at the same time, your backup location would be down too.

        • Some storage manufacturers do burn in on drives. Maybe ask them? (Or just ask the drives for their power on hours if they support that function, like nearly every modern drive with SMART attributes.)
        • How to mitigate? Install the updated firmware ASAP. Done.
        • by Bengie ( 1121981 )
          My cousin worked in an petabyte scale datacenter where dataloss was not acceptable. They made sure their RAID arrays never used bleeding edge tech and used a mix of hardrives from different brands, batches, and models. They would investigate as closely as the controllers. On top of this at the RAID level, which is at per node, the nodes themselves were also balanced for replication such that "types" of nodes had different mixes of harddrives and other storage hardware that could contribute to data loss, lik
      • Maybe it's an even bigger oopsie to have drives with the same power on hours, so they'd all fail simultaneously. Is burning in drives too tedious to have the increase in redundancy it offers? (Even in mechanical drive RAIDs.)

        OK, let's pretend you didn't know about this bug. That no one did. And you staged your new RAID deployments as you've suggested. I'd certainly find it discomforting to find a hard drive failing each week, every week, until the whole RAID array was eventually replaced.

        I'd probably be questioning the entire damn chassis by the 4th or 5th drive failure regardless of array size.

    • Although it's a corporate truism that executives rarely look much beyond the current quarter and its financial results. So SSDs that last for at least 12 quarters must seem almost immortal.

      • In financial time 12 quarters is equal to 4 executives and three golden parachutes. That means it's the next next next guy that will take the hit. And no one liked that guy anyway.

        • by Mal-2 ( 675116 )

          That's IBGYBG [urbandictionary.com] thinking at work. It's fine to eat the golden goose, as long as the consequences land on the next guy.

      • I don't see the problem. These drives had a 3 year warranty, so 3 years and 9 months is just about right.

        /s
    • Re:That's an oopsie (Score:5, Interesting)

      by SirAstral ( 1349985 ) on Tuesday November 26, 2019 @12:35PM (#59457674)

      it always has been an unjustified label. I have seen rack after rack of drive failures for all sorts of explainable reasons.

      We have drives every bit as capable in the consumer space. The only problem is that scalability is limited arbitrarily because "reasons". The IT world is full of this mentality. Server OS vs Workstation OS... it has been and always will be a farce concocted to create the idea that something is more valuable than the other.

      Ebay did it right the first time. They designed their software to survive hardware failure because they would dumpster dive for systems to use and they failed often.

      the year 2019 and we still operate like hardware can never be allowed to fail. Failure should be built in and software should be created with the idea that underlying hardware will fail and to recover from it. The way things are now... the "Enterprise" grade crowd now gets to charge a premium for essentially nothing!

      • the year 2019 and we still operate like hardware can never be allowed to fail. Failure should be built in and software should be created with the idea that underlying hardware will fail and to recover from it.

        Yup. The fact is that "failure" is one of the few things you can count on.

      • by Mal-2 ( 675116 )

        That's a completely different risk model from having the entire fleet of drives keel over within a day of each other, though.

    • by JustAnotherOldGuy ( 4145623 ) on Tuesday November 26, 2019 @12:55PM (#59457794) Journal

      It's "enterprise" as in "criminal enterprise".

    • For anyone who works professionally knows that "Enterprise" Label is a warning of overpriced and poorly designed products, but created by a company large enough to hire so many lawyers in its terms of services that you have a snowballs chance in hell of winning any legal liability against it.

    • by Gabest ( 852807 )
      Gamer >>> Enterprise
  • by dcsmith ( 137996 ) on Tuesday November 26, 2019 @12:18PM (#59457572)
    Pleas insure that the patch is in place before 03:14:08 UTC on 19 January 2038...
    • And that's why one should use languages that don't have those isssues.

      • Just because your programming language doesn't tell you how much memory it's allocating for a variable, doesn't mean it isn't a finite amount.
        • by HiThere ( 15173 )

          While true, there are languages that adapt the size of the integer to the value contained. But that comes at the cost of overhead. So they're not what you want for an embedded system.

          That said, neither C nor C++ tell you the size of the integer. Some external and hardware specific libraries do, but the languages themselves don't. (I don't know the current status, but the guarantee used to only be that a long int would be at least as long as an int, and a short int wouldn't be larger than an int, and it

          • by tlhIngan ( 30335 )

            While true, there are languages that adapt the size of the integer to the value contained. But that comes at the cost of overhead. So they're not what you want for an embedded system.

            That said, neither C nor C++ tell you the size of the integer. Some external and hardware specific libraries do, but the languages themselves don't. (I don't know the current status, but the guarantee used to only be that a long int would be at least as long as an int, and a short int wouldn't be larger than an int, and it didn

            • You may also know this type because compilers got stricter with printf-types and having to use %llu when printing out a uint64_t.

              I know this one again and again and again. It is so much easier to learn a new trick than to change an old one.

            • by HiThere ( 15173 )

              but sizeof(char) = sizeof(short) = sizeof(int) = sizeof(long) = sizeof(long long) is what I said.

              However if the specific inttypes are now a part of the language, then I'm referring to an obsolete standard. I thought they were a recommended extension or something, and that compilers for embedded systems didn't need to implement them.

      • Comment removed based on user account deletion
        • by ceoyoyo ( 59147 ) on Tuesday November 26, 2019 @01:11PM (#59457896)

          An additional question might be "why does the functioning of the drive critically depend on the power-on counter anyway?"

          • by Killall -9 Bash ( 622952 ) on Tuesday November 26, 2019 @01:25PM (#59457990)
            They fucked up the time bomb. It was SUPPOSED to be a random amount of time after the 3 year warranty expires.... Whoever programmed this probably fucked it up on purpose, because fuck managers that ask you to do shit like this.
            • by ceoyoyo ( 59147 )

              Yeah, that's my suspicion too. It would be hilarious if it was intentional as you suggest.

              • I highly suspect a lot of SSD manufactures do this. NAND wears out, but how often do you see an SSD just refuse to work at all? Not read errors, or bad sectors... but just a full-on no workey? Pretty much the only failure mode I've personally seen with SSDs. Never seen one write so much that the NAND wore out.
                • by ceoyoyo ( 59147 )

                  To be fair, most of them have spare sectors and various cell-failure detection mechanisms to mark bad cells and swap in others. The flash should wear out long before the controller electronics do though, so most SSDs should slowly lose capacity rather than fail catastrophically.

                • Never seen one write so much that the NAND wore out.

                  Data forensics consultants who desolder the NAND chips and recover the data externally don't usually find any problems with the NAND, either.

                  It is always the controllers that die, even though they're low power CMOS-type ICs that get almost no wear and tear from operation, and in other types of electronic devices they last decades under normal conditions.

                • Where are the SSD metadata stored? Mapping tables, SMART counters, stuff like that. If the mapping table failed first, the drive would just die, right?

                  Disclaimer: I have no idea what I'm talking about.

          • by AmiMoJo ( 196126 )

            There was a similar case that may explain why.

            Power-on hours is a useful metric for predicting drive failure, which is why it's commonly included in SMART data. Especially for enterprise they may just replace the drive after a certain number of power-on hours because they know that failure rates go up at a certain point.

            In the similar case when the number went negative it caused a sanity check in the firmware to fail. The firmware would assume that the data was corrupt and try to re-load it in an endless lo

          • Comment removed based on user account deletion
            • by ceoyoyo ( 59147 )

              In engineering, if the thing fails when X doesn't work, then it depends on X. Agreed, they may not have done it on purpose, but it also means they didn't do a thorough design review, which is supposed to be one of the things that differentiates "enterprise" from "consumer" and justifies the sticker shock.

      • So you want a language that doesn't have the possibility of bugs? Good luck with that. How about we just admit that you don't actually know anything about software development but like to play an expert on Slashdot and call it a day?
        • So you want a language that doesn't have the possibility of bugs?

          I call the language Ohmmm, it accepts inputs in this reality, and then sleeps until humanity evolves into pure energy beings, and then it will provide the output.

          If you think you have a bug, you don't, you're just not waiting patiently enough.

          Also, it doesn't need a compiler, you just write your pseudocode or flow chart anywhere you want and then you wait. Patiently.

    • Y2038 problem ( https://en.wikipedia.org/wiki/... [wikipedia.org] ) will be fine! (Said in the voice of that dog sitting, burning down and saying "This is fine.")
      • by HiThere ( 15173 )

        Well, it will be fine for everything that has transitioned to 64 bit hardware. Well, almost everything. But if you coded something that needs to be long on 32 bit hardware as a short, you don't have grounds for complaint.

        The thing is, lots of embedded systems don't see any need for 64 bit hardware, and it's not clear that they will detect any problem ahead of time. Imagine the havoc if one day all the Alexi's stop working. Or the IoT locks. etc. (I'm assuming that things like GPS satellites have alrea

        • Know one installation that did a lot of work with co-ordinating data from satellites, all Unix. Back in the 90's, manager asked 'Say, I've been hearing a lot about this Y2K thing, will that affect us?'. 'Nah, we're good to the year 2036'. 'Huh. Can you make that later?' One quick script to scan all the software, identifying all locations where the time variable was hardcoded as 32 bits, few hours to run, and then approved to change them all to 64 bit. They're good to the year 584 Billion or so now ...
    • by AmiMoJo ( 196126 )

      It amazes me how often basic stuff like overflowing counters is not tested in embedded code. It amazes me that people don't look at anything counting time and immediately calculate the maximum period before the counter fails to see if it is safe.

      This smacks of incompetence. It's 15 bits so they are presumably using an int16. Might even be an int... Most of these controllers are ARM 32 bit based but you do get 8 and 16 bit cores doing power management stuff. Anyway, there are only two explanations for pickin

    • So on a 1TB hard drive, there were just 15 bits to count uptime ?
      That's hilarious.

  • Those SSDs could command a premium for some very special applications. Such as storing super vital top secret corporate (or government) information that might have to be embarrassingly revealed in legal proceedings down the road. Such as DNC strategy documents from 2016... (although they could always ask the Russians for copies ).

    "Whooops, bad luck, that data was unfortunately stored on some of our top-end HPE SSDs, and now it's just a fading memory".

    As HPE itself will soon be, no doubt.

  • by 93 Escort Wagon ( 326346 ) on Tuesday November 26, 2019 @12:25PM (#59457612)

    ” HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected. 3PAR, Nimble, Simplivity, XP and Primera are not affected.”

    There’s also a table of specific models with manufacture dates in TFA.

  • by b0bby ( 201198 ) on Tuesday November 26, 2019 @12:25PM (#59457614)

    This is why I default to RAID10 with SSDs, and make sure to use 2 different drive types in each pair.

    • This is why I default to RAID10 with SSDs, and make sure to use 2 different drive types in each pair.

      That's quite interesting. RAID10 is basically a grouping of RAID1 pairs. Due to bad experiences in the early days of RAID I have always made it a standard practice to avoid mixing drive types in RAID1. In fact, I've even been reluctant to mix different firmware revisions on the same drive models.

      I have used mismatched drive models here and there out of necessity, but generally as a stopgap measure. I'm curious to know what other people make of this practice.

  • Did the government pressure them to put in a backdoor to make everyone's data accessible? With all the government illegal spying going on lately, this is certainly something to consider.
    • That's an awesome theory! Please enlighten us and explain how using an int instead of a long accomplishes such a wonderful backdoor feature!
      • by Chaset ( 552418 )

        THAT won't achieve it by itself, but I inferred that the implication is that it will force a lot of people to install said "patches" which probably comes as a binary blob with who knows what in it.

        • It is firmware on a SSD drive. It maps ATA drive access requests to virtual sectors on the drive. There is also a very limited amount of space available for the firmware. You have nothing to worry about, other than your need to seek psychological help for your paranoia.
  • 3PAR, Nimble, Simplivity, XP and Primera are not affected. Phew

    • by chrish ( 4714 )

      Wait, was "Simplivity" seriously a product name?

      OMG it's worse than that, it's an IT services company.

      I thought for sure that was thrown in to make fun of stupid names for products.

  • You have *got* to be joking. Where I used to work, before I retired, I'd have been running in circles, worrying about 170 or so servers and workstations, figuring out which ones had them. Updating firmware was something we did *only* when it might fix an issue. I can't imagine a large shop... or, for that matter, a large cluster with that issue.

    This is where HPE should be *required* to notify all purchasers, and all vendors who sold drives, or systems with the drives, to contact the purchasers "proactively", not wait for the failure.

    Oh, that's right, that would be "burdonsome regulation"....

    • "Updating firmware was something we did *only* when it might fix an issue."

      Great news! This absolutely will fix the issue!

      "This is where HPE should be *required* to notify all purchasers, and all vendors who sold drives, or systems with the drives, to contact the purchasers "proactively", not wait for the failure."

      So your position is that they should have notified people about the bug first and discovered it later?

      "Oh, that's right, that would be "burdonsome regulation"...."

      No, that would be violating the b

    • > Oh, that's right, that would be "burdonsome regulation"....

      If you're an HP shop and this happens to you, then never buy HP again.

      That's called "customer regulation" which economists recognize as the most pervasive and effective form of regulation.

      That said, people only buy HP servers and drives and support so they have somebody to blame when everything goes to hell and their boss is breathing down their necks. We're about to see how well that theory holds, apparently.

    • by barakn ( 641218 )

      So you couldn't do something as basic as keeping a list of what hardware was installed on which servers?

      • Sure, in an ideal world every corporation would be appropriately staffed for the size of the infrastructure. Unfortunately that's the exception rather than the rule. I myself in a past life (OK... just over a decade ago) became the sole sysadmin/engineer/architect at a surprisingly large company running infrastructure on about 100 physical servers in 20 different locations plus a couple of storage arrays. I spent so much of my time with my hair on fire trying to resolve issues that the idea that I would hav

    • by jabuzz ( 182671 )

      In my experience you need to keep on top of firmware updates, because vendors will invariably put the phone down after telling you to update the firmware before they will progress the support call. Scrabbling around doing firmware updates when you have a problem is not fun.

      Anyway I work in the public sector in Scotland and the Scottish government has mandated that we have CyberEssentials Plus. so I have to suck it up and apply updates in a timely manner regardless.

    • This is where HPE should be *required* to notify all purchasers, and all vendors who sold drives, or systems with the drives, to contact the purchasers "proactively", not wait for the failure.

      Based on responses here in Slashdot, they are doing precisely this. Look I'm all for government regulation, but this wouldn't be burdensome as much as completely pointless.

  • "...which causes these drives to stop working after 32768 power-on hours..."

    Okay, if that's not hilarious I don't know what is.

    That number...it's almost as if I'd seen it before...

    • Yes it's the first byte of the character generator ROM in the VIC-20 memory map!
      Good catch!

      • by ceoyoyo ( 59147 )

        The VIC-20 truly was a marvel, doing things that haven't been duplicated since. Such as storing 32768 in a single byte.

        • The VIC-20 truly was a marvel, doing things that haven't been duplicated since. Such as storing 32768 in a single byte.

          I still remember how to do this, it is called a lookup table.

          Math is wasteful, it should be avoided in software. And in hardware.

    • stop working after 32768 power-on hours...That number...it's almost as if I'd seen it before...

      "31000 hours ought to be enough for anyone"
      -Gill Bates

  • HP is failing (Score:5, Interesting)

    by smooth wombat ( 796938 ) on Tuesday November 26, 2019 @01:00PM (#59457822) Journal

    Right now we have PC orders as far back as April which are still unfilled. We have other orders where only certain parts of the order have arrived.

    After the new Intel chips came out we were told the rate at which we receive equipment would get better and even closer to our SLA which says 15 business days. As yet, in this entire year, not once has HP come close to getting us our orders in the time their contract states.

    We were told earlier this month by our third-party provider, HP printers will take at least a month to arrive after they are ordered.

    Now HP says their SSDs will fail in less than 4 years, without any warning at all.

    And yet, for whatever reason, Xerox wants to pay a premium for HP because of "synergies".

    HP is dying, and no one at HP cares.

    • by EvilSS ( 557649 )

      Right now we have PC orders as far back as April which are still unfilled. We have other orders where only certain parts of the order have arrived.

      That's a different HP though. HP split into HP and HPE a while back. HP does workstations, desktops, laptops, printers, etc. HPE is servers, networking, storage, etc. They are totally separate companies these days.

      • I understand they are different companies, but both are still run by the same people as before.

        Thus the issues I described.

    • After the new Intel chips came out we were told the rate at which we receive equipment would get better

      You're buying systems with Intel chips in them on purpose? You're part of the problem.

      Regardless, HP has been garbage since... well, basically forever. HP-PA was the worst performing architecture among its fellows. They squandered DEC completely. I had a HP laptop I got from an employer and it took me literally about 24 hours of phone time to get it replaced when the GPU failed due to a known problem (G71 die bonding failure.) It was kind of working (crashing on overheat, so I could use it as long as I used

  • by Retired ICS ( 6159680 ) on Tuesday November 26, 2019 @01:15PM (#59457916)

    The warranty period is only three years. Why is this a problem?

    • Where I live there is also the legal warranty, which is usually much longer, but you might have to go to court to enforce.

  • ... stop working after 32768 power-on hours ...
    .

    For starters ... who uses a signed integer to count something?

    • by Viol8 ( 599362 )

      Top bit could be parity.

      Yeah, I'm grasping...

    • by rokicki ( 132380 )

      Anyone with a clue.

      Unsigned values in C and C++ are a leading cause of bugs, including security bugs.

      Unsigned size_t was a major flaw. STL size() returning an unsigned value just made it worse.

      Anyone who codes in C or C++ knows this.

  • by backslashdot ( 95548 ) on Tuesday November 26, 2019 @01:52PM (#59458206)

    Could this be a test (deliberate or not) of software based disabling of a product one day after the warranty to force you to buy the new one? This is like when they remove existing app features in an update to force you to pay for the Pro version or the upgrade.

    Should it be legal? No protection for the consumer on that ?

  • The HPE support doc covers what HPE customers need to do, but everyone else needs to be wondering: is this just specific to HPE firmware (pre-HPD8), or are other SSDs in the wild at risk?

    Since HPE probably didn't manufacture the SSDs, who did? Or is it not possible to tell?

  • by twocows ( 1216842 ) on Tuesday November 26, 2019 @02:33PM (#59458478)
    This seems like the kind of thing you'd normally issue a recall notice for. "Patch or assume the risk of failure" seems like they don't want to take responsibility and would rather shift blame to customers. I suspect they're going to lose a lot of business in about four years.
  • Yup (Score:5, Interesting)

    by ruddk ( 5153113 ) on Tuesday November 26, 2019 @03:06PM (#59458700)

    Got 32 servers at work, filled with these drives running a vSAN. And 30 days to patch them. HP actually called us yesterday. I think that’s a first that they called us about problems.

    • HP actually called us yesterday. I think that’s a first that they called us about problems.

      Their liability insurance company probably told them that they have to do it, and they have to comply to maintain their policy in this most critical hour.

  • You should never buy from a single vendor. Mix and match.

    Once we had ~50 enterprise grade SSDs from a large vendor fail at the same time (I cannot name the vendor), we lost an entire DC (thankfully no client impact).

    It was a firmware bug that showed up under our particular workload.

    • If you're buying HPE storage arrays, HPE will only support it if you have qualified HPE drives in the array. You don't get the option of mixing and matching drives from different vendors, they all have to be HPE and they (generally) all have to be from the same series.

      It has long frustrated me that you can't put your own storage in HP (and other enterprise-type vendors) arrays. Why should you have to pay anywhere up to 5x to 10x the market rate for drives that have the magically "qualified" and "supported"

  • I think linux should detect the hw model and warn users about that problem, before people get their ssd's failing.

  • Wow, I'm glad I paid extra for "Enterprise" SSDs. I mean, it'a all of HP's special sauce that they bake into the firmware of an otherwise unremarkable consumer drive that makes it worth 5x to 10x the market rate for flash storage. It's great to see that HP are actually making their own firmware instead of shipping the drives with the OEM's firmware on them.

  • The Nexus-7 drives don't have the lifespan limitation.

The biggest difference between time and space is that you can't reuse time. -- Merrick Furst

Working...