Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Bug IT

HPE Says Firmware Bug Will Brick Some SSDs Starting in October this Year (zdnet.com) 97

An anonymous reader writes: Hewlett Packard Enterprise (HPE) issued a security advisory last week warning customers about a bug in the firmware of some SAS SSDs (Serial-Attached SCSI solid-state drives) that will fail after reaching 40,000 hours of operation -- which is 4 years, 206 days, and 16 hours after the SSD has been put into operation. HPE says that based on when affected SSDs have been manufactured and sold, the earliest failures are expected to occur starting with October this year. The company has released firmware updates last week to address the issue. HPE warns that if companies fail to install the update, they risk losing both the SSD and the data. "After the SSD failure occurs, neither the SSD nor the data can be recovered," the company explained.
This discussion has been archived. No new comments can be posted.

HPE Says Firmware Bug Will Brick Some SSDs Starting in October this Year

Comments Filter:
  • by Myself ( 57572 ) on Wednesday March 25, 2020 @11:20AM (#59870132) Journal

    ...from the 32,768 power-on hours bug [slashdot.org]?

    • by AmiMoJo ( 196126 )

      Well they say 40,000 so maybe it's different. Let's see...

      40k hours = 2,400,000 minutes = 144,000,000 seconds. Doesn't seem close to any powers of 2, but they might not be using seconds as their base. They could use any time base really.

      • Re: (Score:2, Informative)

        by OrangeTide ( 124937 )

        (2**32) / (4.57 years) is roughly 30 Hz. The hours might be a rotation count of the spindle, normalized down to some common factor of usual spindle speeds (7200, 5400, etc) . But it may be a coincidence.

        The "Power On Hours Count" for SMART is notorious for using a unit size that is manufacturer dependent: "Despite what the name suggests, the raw value of the attribute is stored using all sorts of measurement units (hours, half-hours, or ten-minute intervals to name a few) depending on the manufacturer of th

        • Re: (Score:2, Troll)

          by Dog-Cow ( 21281 )

          You could try reading the fucking headline before you post about counting spindle rotations.

          • HPE says firmware bug will brick some SSDs starting in October this year
            HPE releases firmware patch to prevent some SSDs from failing after reaching 40,000 hours of operation.

            Did you read it? Try re-reading it.

            The bug is nearly identical to another issue the company disclosed in November 2019, which also permanently crashed HPE SAS SSDs, but after 32,768 hours of operation (3 years, 270 days, and 8 hours). However, today's bug impacts far fewer SAS SSD models than the previous issue.

            Just like last year, HPE said it learned of the bug from another SSD manufacturer that uses its products.

            Did you read that bit? Try re-reading it.

            And finally you can try reading the limited information in the official release notes [hpe.com]

            The issue affects SSDs with an HPE firmware version prior to HPD7 that results in SSD failure at 40,000 hours of operation (i.e., 4 years, 206 days 16 hours). Neither the SSD nor the data can be recovered after the SSD failure occurs.

            So pray tell what information did you glean from the headline that explains the mechanism of failure satisfactory? I'll wait.

        • by Khyber ( 864651 )

          "The hours might be a rotation count of the spindle"

          Or not because this is modern times and we're talking about SOLID STATE DRIVES which have no moving parts.

          I mean, critical thinking isn't that hard to do if you even just RTFS.

    • by esperto ( 3521901 ) on Wednesday March 25, 2020 @11:34AM (#59870196)
      Thought the same thing, but apparently is another issue affecting a different set of drives, and, as they are saying the first to crap out will be only in October, they probably only found it out because of what happen last year, where people actually lost data. Still amazed how something like this can happen and only shows how "enterprise" label is not much more than a gimmick.
      • by TheGratefulNet ( 143330 ) on Wednesday March 25, 2020 @12:19PM (#59870396)

        no, its more a statement about how bad HP is, these days, in terms of quality and design.

        in fact, they don't do work anymore, themselves; they are now mostly a 'labelling' oem, having others do the real work.

        when real actual honest HP was still in business, decades ago, they were at the top of their game. now, they are a junk company, not trustable any more than some low end whitebox vendor.

        • by msauve ( 701917 )
          >real actual honest HP was still in business They are still in business, it's just that they're now called Keysight.
          • by Myself ( 57572 )

            The test and measurement division known as Keysight was never closely related to any of the computer-hardware divisions.

            And the server-hardware division known as HP Enterprise is actually Compaq, absorbed by merger. It doesn't even share much blood with the desktop PCs division.

            • by msauve ( 701917 )
              >The test and measurement division known as Keysight was never closely related to any of the computer-hardware divisions.

              No, but it traces back directly to the roots of HP, which was founded with a test/instrumentation product (an audio oscillator). It deserves (and would honor) the HP name. The computer stuff came much later.
        • no, its more a statement about how bad HP is, these days, in terms of quality and design.

          Horrible that this bug exists. However it seems it was caught and a patch made available many months before a single customer even had the opportunity to be affected. As much as the bug shouldn't exist in the first place the world would be better for it if all bugs were identified and handled in this way.

      • While it would be nice to think HPE learned from past experience this is a large company we are talking here. I wouldn't discount the possibility that some pre-production samples they had been using internally just crapped out on them.
    • by sjames ( 1099 )

      TFA made it clear that this issue is distinct from the 32,768 issue. RTFA!

  • This requires seriously poor design and QA. There is an initial issue of holding some sort of counter on the wrong integer type (which, by now, should be caught by any mid-experienced developer), but then to make any sort of such counter giving the wrong count basically destroy the device/data?
    And it is not even the first instance, according to the article there was a separate "32768" hours issue that disabled other SSDs disclosed a few months ago (https://www.zdnet.com/article/hpe-tells-users-to-patch-ssds

    • Why would you blame HPE's "SSD firmware department" for this? One of the previous bugs I'm aware of (which was after 1,700 hours of power-on time) was an Intel issue with the S4510 and S4610-series SSDs:

      https://downloadcenter.intel.c... [intel.com]

      This bug appears to be with SanDisk SSDs, given the model numbers affected. This article:
      https://blocksandfiles.com/202... [blocksandfiles.com]

      shows a picture of one of the affected models, and on the label you can see "Supplier Model Number: SXKLTK" and "Supplier Part Number: SDLTOCKR-016T-5C1

      • by gweihir ( 88907 )

        HPE is to blame as they re-sold an enterprise product, apparently without running any real tests on it.

    • by gweihir ( 88907 )

      This requires seriously poor design and QA. There is an initial issue of holding some sort of counter on the wrong integer type (which, by now, should be caught by any mid-experienced developer),

      Remember that software that Boeing used to kill > 300 people recently? Made on the cheap. This one here just erases critical data and hence is probably made even cheaper. It is high time that not using experienced actual software engineers to write code on this level is classified as gross negligence per default and that the vendor that screwed up is liable for any and all damage done by them not using qualified personnel.

  • by TXJD ( 5534458 )
    Lots of these drives end up on eBay, hopefully there won't be too many bricked devices.
  • becomes scheduled obsolescence, you know things are fucked.
  • Serious here: Does HPE sell these SSDs to any OEM (Dell,Apple, etc)? Are they sold solely to corporate accounts or to consumers as well?
    If these drives are in consumer machines, I'm not sure there is any way of tracing to original purchasers, let alone whoever was given them as a birthday present, or second-hand sales.

    • by Pascoea ( 968200 )
      What's your definition of "consumer machines"? It's not likely to be in grandma's desktop or junior's gaming rig (not from an OEM anyway) they're just too expensive compared to their consumer brethren. As far as enterprise gear, even if they are OEM'd to other server vendors those will be serial number tracked by the vendor. But yeah, second-hand turns into a shit-show real fast. If I had bought a server off eBay any time recently I'd be checking firmware post haste, at least by October, anyway.
    • They're SAS drives, which wouldn't work in most consumer PCs. The article says that HP doesn't even sell them separately, but that they may be used in some servers.

      These are expensive drives, on the order of $2000 for the 800GB version. I'm betting there weren't a lot of them given out as Christmas presents.

    • Serious here: Does HPE sell these SSDs to any OEM (Dell,Apple, etc)? Are they sold solely to corporate accounts or to consumers as well?
      If these drives are in consumer machines, I'm not sure there is any way of tracing to original purchasers, let alone whoever was given them as a birthday present, or second-hand sales.

      TFS indicates that these are SAS (Serial Attached SCSI) drives, sold by HPE (the 'E' stands for Enterprise), not HP Ink (pun intended, the consumer division). So, is easy to infer these are Enterprise drives.

      AFAIK, Apple has no machines that have native SAS ports (not even the cheesegrater 2.0), so Apple is not an option. Actualy, now that I think about it, I am hard press to think of consumer machines that support SAS drives right out of the box, so not sold to consumers.

      As for Dell, they do have an Enterp

    • by tlhIngan ( 30335 )

      Serious here: Does HPE sell these SSDs to any OEM (Dell,Apple, etc)? Are they sold solely to corporate accounts or to consumers as well?
      If these drives are in consumer machines, I'm not sure there is any way of tracing to original purchasers, let alone whoever was given them as a birthday present, or second-hand sales.

      Not likely. HPE likely bought the drive from an OEM and rebadged it, which is who 99% of the stuff happens. It does mean that the OEM drives might have the same firmware bug, except we're not

  • Any chance if you do find yourself to be a victim later of e.g. SMD soldering on a chip with new firmware?

    • Since it says something about data being unrecoverable, perhaps it has to do with some mechanism of the hardware full disk encryption, perhaps when it hits that run time length something corrupts with the stored encryption key. So no replacing an eeprom if one is even used may not get your data back, but might at least give you a usable drive. Though not even sure if SSDs use an EEPROM of some other type of separate flash memory for the firmware or if it just gets stored on the same flash as the data?
    • I worked on an SSD product and we kept the firmware in a section of NAND itself and had a mask ROM on the processor that could read the first few "boot blocks" of NAND. Prototype devices had a SMD socket that looked like a coffin. Production devices couldn't be ISP'd, so it was a lot of work with a heat gun to remove them and replace them with a working chip. Unfortunately all your data and configuration was on the old chip.

  • On a new level.

  • '40000' is a little too rounded-off, base-10, don't you think? Is it really 40960 hours?
    • I think 39,768.2157 hours makes more sense, if you're into powers-of-two. That's 17,179,869,184 cycles at 7200 RPM. Probably shifted down by 4 before being logged, and thus filling a 32-bit value.

      • SSDs don't have an RPM to them. Unless you're saying it runs a 120Hz clock? That would be an odd tic rate in a DC system.
        • I'm probably stretching to make the numbers fit. Maybe a holdover from the HDD era, as SSDs have the same SMART fields as an old drive. So one has to wonder what "spindle start/stop count" and "spin up time" even mean in the context of an SSD.

    • It makes sense if it is in the error checking; that would be a human-generated time period.

  • by williamyf ( 227051 ) on Wednesday March 25, 2020 @12:01PM (#59870308)

    Normally, these SAS-SSDs are NOT used in servers, except for very rare cases. More often than not, these are used in Storage Arrays, either as some sort of cache, or tiered storage, or AFA (All Flash Array). While in a PC/Server it may behove you to make the drive read only after the designated number of hours, in an array, the drives will probably be in some sort of RAID configuration Modern Ones are 0, 1 or 6, legacy or clueless people still use 5, 10 or 01. In these situations, having a read only drive is as bad as having it fail completely.

    Having said that, is amazingly stupid to brick the drive upon reaching the designated number of safe hours, istead of just throwing an alarm through the standard S.M.A.R.T. mechanism...

    And to top it off, this happens twice?! This points to either sheer incompetence of the HPE people in charge of the firmware, or HPE incorrectly using/modifying an Of-The-Shelf firmware (or a combination of both).

    I hope that their arrays have adequate mechanisms for updating the firmware of drives one by one without rebuilds, like many array vendors have.

    • Id imagine, depending on the use cases and data on the drives, many places probably wont even bother updating firmware. Firmware update on a disk alone risks corrupting the data on the drive. They'll probably opt to either replace all the disks or full storage arrays. At nearly 4 years old these drives are at end of life for any sane operator. I don't even let disks go past 3 years in my own usage at home in my NAS. After 3 years I usually need larger disks anyways for capacity reasons and the ticking time
  • Knowing HP, this was some feature they were working with where they sold a license to use the drives at 3/4 the regular new price, but you had to repurchase a license to keep using them every 40k hours of uptime over and over. But they forgot to disable it.

    Sounds about right.

    • by Agripa ( 139780 )

      Knowing HP, this was some feature they were working with where they sold a license to use the drives at 3/4 the regular new price, but you had to repurchase a license to keep using them every 40k hours of uptime over and over. But they forgot to disable it.

      Sounds about right.

      It is a lot like HP's "limited" capacity ink cartridges.

  • Programmers need to held accountable for their bullshit software
    • by Nkwe ( 604125 )

      Programmers need to held accountable for their bullshit software

      Programmers? What about executives at companies that don't want to pay the cost for a complete software development life cycle (as in real design, real code reviews, and real testing)? Maybe we also should arrest people who make purchasing decisions based on cost - If you decide to purchase the $200 drive instead of the $205 one with the only reason being to save $5, you are encouraging companies to cut corners.

    • Burn them! Stake them! Quarter them!

      Wait, what if it was the PHB who did it?

  • Isn't it handled easily by fwupdmgr?
    • HP also doesn't provide firmware updates for hardware they are no longer selling.
      • by psergiu ( 67614 )

        Also you need to have a valid and active support contract for that specific product type to be able to download anything off the support site.
        Even if you find a bootload of those SSDs second-hand with less than 40k hours on them - good luck getting the "good" firmware.

  • Lifetime timer? (Score:4, Interesting)

    by Locke2005 ( 849178 ) on Wednesday March 25, 2020 @01:43PM (#59870824)
    Sounds like planned obsolesense to me. I got upset when the smoke/carbon dioxide detectors in the house both started beeping every 30 seconds within a few minutes of each other in the middle of the night. Replaced the batteries; still beeping. Come to find out, they intentionally set a 7 year timer that disables the detector after 7 years, to force you to buy a new one! I wouldn't put it past HP intentionally forcing you to buy new SSDs...
    • by DRJlaw ( 946416 )

      Sounds like planned obsolesense to me. I got upset when the smoke/carbon dioxide detectors in the house both started beeping every 30 seconds within a few minutes of each other in the middle of the night. Replaced the batteries; still beeping. Come to find out, they intentionally set a 7 year timer that disables the detector after 7 years, to force you to buy a new one! I wouldn't put it past HP intentionally forcing you to buy new SSDs...

      Let me get this straight, you got upset because a device that uses a

      • Let me get this straight, you got upset because a device that uses a radioactive source with a short half-life (ionization smoke detector) or a chemical detector that slowly degrades (carbon monoxide detector) has a 7 year timer that keeps it from running past the point where the device can fail to detect dangerous levels of smoke or carbon monoxide, so that you may not get a timely alarm and instead die.

        Half life of the radioactive source in smoke detectors is well over four hundred years. The CO component does degrade but rendering the whole thing useless because of that is BS.

        • by DRJlaw ( 946416 )

          Half life of the radioactive source in smoke detectors is well over four hundred years.

          Which means after 10 years you've lost 1.6% of the 1/5000th of a gram of americium-241 that is in the smoke detector. How much loss do you think that detector can tolerate and still be sensitive enough to trigger while giving you a reasonable escape time?

          The CO component does degrade but rendering the whole thing useless because of that is BS.

          "The whole thing"? Then don't buy a combination detector, genius.

          But no, you'

          • by Khyber ( 864651 )

            " How much loss do you think that detector can tolerate and still be sensitive enough to trigger while giving you a reasonable escape time?"

            I've got detectors from the 1960s from an old industrial building that still operate very well.

            Americium-241 emits primarily alpha radiation, which is stopped by just about anything, even small levels of steam in the air. If it stops emitting (that's unlikely for several generations,) the detector FAILS ON, assuming the battery/power and other circuitry are good. That's

            • by DRJlaw ( 946416 )

              Americium-241 emits primarily alpha radiation, which is stopped by just about anything, even small levels of steam in the air. If it stops emitting (that's unlikely for several generations,) the detector FAILS ON, assuming the battery/power and other circuitry are good.

              That's all very nice, and not how ionization smoke detectors work [systemsensor.com]. The alpha radiation ionizes the air, which chemically interacts with the particulates in smoke, which changes the conductivity of the air in comparison to a sealed reference

              • by Khyber ( 864651 )

                An ionization smoke detector uses americium-241 to do more than ionize air. Any detectable difference due to smoke is detected and an alarm is generated. That difference can be caused by interference with a radiation detector or an ionization chamber or collector plate not generating a charge due to radiation being blocked. Several types of detection were in use in the 1960s. I've actually worked in a fire department (Memphis, volunteer squad, primarily worked with them doing inspection of in-building detec

                • by DRJlaw ( 946416 )

                  An ionization smoke detector uses americium-241 to do more than ionize air.

                  Nope. Don't care about a claim that's already been shown to be false. You're wasting your time making source-free claims.

                  Several types of detection were in use in the 1960s

                  Not relevant to the home smoke detectors in use 60 years later.

                  I've actually worked in a fire department (Memphis, volunteer squad, primarily worked with them doing inspection of in-building detectors.

                  Your go-to claim is that you've actually worked relevant indus

      • It's a Kidde Smoke/Carbon Monoxide detector, and to be fair, apparently the Carbon Monoxide detector has a short expected lifetime. The Kidde Smoke detectors without Carbon Monoxide detection don't do this. And... replace units were $54 _each_.
    • by antdude ( 79039 )

      Is there a list of these smoke/carbon dioxide detectors that expire?

      • by Agripa ( 139780 )

        Is there a list of these smoke/carbon dioxide detectors that expire?

        All of them expire now because it is required by state level legislation.

    • Like the way that at least some of their switches require their SFP's? A handful of years ago I saw that they wanted to charge $2200 or so for an SFP, where I could buy them from ChampionOne for $110. They were all made by Finisar and were identical.

      • by Agripa ( 139780 )

        At least it was not like Dell who kept the ATX power connector on their motherboards but changed the pin assignments so that replacing the Dell "ATX" power supply with a standard ATX power supply destroys the motherboard or worse.

        • Oh hey if we're going get into connector conspiracy:

          The Sun 4/110 had a one-off sw (SCSI Weird) interface that IIRC applied termination power to pin 26, so it was possible to fry non-Sun peripherals unless one put a Sun device first in the chain or carefully cut pin 26 on a cable.

          Sony back in the day would equip portable CD players with a DC power plug with tip/ring reversed from normal polarity.

  • HPE warns that if companies fail to install the update, they risk losing both the SSD and the data.

    The fundamental law of storage design is that fuck-ups above the lowest layer of media integrity control should not melt the underlying raw data structures.

    Having a bug: dime a dozen.

    Melting the entire universe: Second career in the Mexican construction trade under an assumed identity.

    • by bored ( 40072 )

      These kinds of failures are frequently caused by planned obsolescent bugs. AKA, it should be going into some kind of fail-safe read only mode when it hits some number of age related metrics, and then after some additional time simply refusing to work. But then some bug tips and it goes right from working to refuses to boot enough to upgrade the firmware.

  • It seems all it does is being much more expensive. It certainly is not carefully tested...

  • Anybody know who the drive vendor is (I'm pretty sure HP doesn't make them, just rebrands them)?

"Hello again, Peabody here..." -- Mister Peabody

Working...