Forgot your password?
typodupeerror
Data Storage AI

Hard Drive Shortage Intensifies as AI Training Data Pushes Lead Times Beyond 12 Months (tomshardware.com) 24

Lead times for high-capacity hard drives have exceeded 52 weeks as AI workloads drive unprecedented demand for warm storage that sits between fast SSDs and offline tape archives, according to TrendForce. Western Digital notified customers of price increases across its entire hard drive portfolio citing demand for "every capacity" in its product line.

The shortage stems from AI infrastructure requirements including training datasets, model checkpoints and inference logs that consume petabytes of storage space. These files are too large for primary SSD storage but must remain accessible for quick retrieval. Hard drive manufacturers have not significantly expanded production capacity in approximately a decade. Cloud service providers are evaluating QLC SSDs for cold data storage despite costs remaining four to five times higher per gigabyte than mechanical drives. Memory suppliers are developing SSD products specifically for this intermediate storage tier.
This discussion has been archived. No new comments can be posted.

Hard Drive Shortage Intensifies as AI Training Data Pushes Lead Times Beyond 12 Months

Comments Filter:
  • These files are too large for primary SSD storage but must remain accessible for quick retrieval

    I think you mean "not used often enough to warrant the price of SSD storage" not "too large for SSD storage."

    If you think a multi-petabyte file is too big to fit on SSD storage: combining multiple physical storage devices into one virtual device has been a thing for a long time now.

    • by znrt ( 2424692 )

      it's indeed weird, they're throwing money around like crazy for the training race, why would they bother with even a 200% cost increase for storage? not to mention that ssd is cheaper energy wise in the long run.

      i have not looked at the figures, but maybe the increased production times have to do with production capacity or supply chain problems?

      • Re:Too large? (Score:4, Informative)

        by DDumitru ( 692803 ) <.moc.ocysae. .ta. .guod.> on Monday September 15, 2025 @05:11PM (#65661856) Homepage
        You need to do the math.

        1. QLC SSDs are roughtly 4X the cost of HDD for a given amount of space.
        2. While power is lower for SSDs, the cost savings does not even come close to the up front cost.
        2a. The power for storage is literally in the noise compared to the GPU/AI compute power requirements.
        3. The "other option" would be to use more older (ie smaller) HDDs, and even off-lease mining HDDs, not QLC SSDs.
        4. NAND capacity is limited like HDD capacity in terms if exabytes. HDDs are still larger, although it is getting closer. Neither has the ability to ramp up quickly.

        The people building these system do all the math. It is not knee jerk. They know every detail of reliability, power, space, availability, duty cycles, and multi vendor suppliers and take all these into account.
        • Idle power for HDDs can be very low if the spindles are not kept spinning. For training that takes many days, according occasional HDD spin up times of 10 seconds may be acceptable. In that case, it's basically only the HDD electronics board that is powered. And when comparing similar capacities, multiple SSDs are needed to equal the capacity of one HDD, so it's not clear that in idle mode SSDs consume less power than HDDs.

          • by bn-7bc ( 909819 )
            Yes but how many hdds domyou need tomstripe together to get the red/write performance of one ssd, esp on random access, and how much power would that array connsume
            • by bn-7bc ( 909819 )
              And ofc I forgot to read back my post before hitting submit, if the typos are to distracting plz tell me and I'll reposed with ( hopefully) fewer typos
          • by Kisai ( 213879 )

            They are always kept spinning during training. If the server has enough enough disk cache, they will only be hit when written to. However a 320GB model being trained might still be using 16 20TB drives for the actual data being trained on if they're keeping it around and not just ripping it from servers on the fly.

        • by allo ( 1728082 )

          The most effective system for many tasks is HDD + SSD cache. And that's also quite cost effective. And you can scale it dynamically. The HDD is still too small? Add another SSD as cache. It's fast enough? Just keep with one.

          And I don't think power is a factor at all. As server PSU probably has more energy loss due to not being 100% efficient than the hard drive needs. For a company investing a huge amount of energy in GPU computing that's the least concern. If they care at all, then because each of the watt

    • by Kisai ( 213879 )

      Nah. Most models in use, are still only about 500GB tops. That's perfectly suitable for a SSD to load for inference.

      The problem is training because checkpoints are 500GB per iteration. So you may write a checkpoint every X many iterations and take the best one, so you aren't doing this with a 2TB SSD. The end result is that if you are trying to get the best model, you can't just set it and forget it, you have to set it and check it, and if you're throwing away 10 checkpoints without checking you might overs

    • by twms2h ( 473383 )

      combining multiple physical storage devices into one virtual device has been a thing for a long time now.

      Apart from the price tag: You would need many more ports for connecting SSDs on the server than you would need for HDDs with a higher capacity each.

  • The market will clear, when a huge provider snaps up a large percentage of every little player's clients. Or municipalities tire of selling scarce power to people who don't vote in their elections. It's an analog of waves of bank failures. Too much compute; quicker than too much money.
  • While I welcome our cheaper but slower SSD overlords, isn't the storage the price driver?

    Would 4TB of NAND (or whatever) care what you wrap around it?

    • I think the "developing products" in the summary is a big of a miss statement. I was at Flash Memory Summit (now Future Memory Summit) and a lot of the noise is other storage than NAND. Nothing yet compelling, but everyone is trying. A lot of this is that NAND is not staying still. A couple of vendors are likely to introduce PLC (5 level cells) as shipping solutions as early as next year, and for a lot of workloads, they are quite capable.
  • But apparently WesternDigital also wants its share of the AI pie by increasing their prices.

    • by thegarbz ( 1787294 ) on Monday September 15, 2025 @06:50PM (#65662048)

      Yes everything looks like a conspiracy to the short sighted. The reality is there was zero reason to expand HDD production at a time when more and more data storage requirements trended towards SSDs. Sales of HDDs have plummeted since their peak in 2010, and it sure as heck isn't worth attempting to predict a bubble 3 years out and investing $1bn to expand HDD production because maybe some AI techbros will briefly hoover up data before their industry implodes.

      • by allo ( 1728082 )

        EAMR seems also to be such a hack. I wonder if they find a better method or if SSDs make the price/value race before that.

        • EAMR seems also to be such a hack.

          Everything we've ever created is "such a hack". It's the application of physics in ways to solve a problem. EAMR is no more a hack than changing the magnetic head orientation is. Or changing the size of the write head, or the material of the platter, or making heads aerodynamic. It's just engineering.

          • by allo ( 1728082 )

            Maybe. Still it "feels" like heating up the surface is some rather crude workaround and would increase wear and tear?

            • There's nothing to wear on in under underlying material. The heat here allows for changes in how magnetic fields respond, the temperature effect on magnetic structure is well understood. The only thing new here is how incredibly miniature it is and how they found a way to focus the heat into the required spot.

              It's not a hack, it's R&D, science and engineering and something that has been actively researched in spindle HDDs for 20 years now, and has been the subject of research for nearly 70 years. Check

              • by allo ( 1728082 )

                Fair point, but one can still question some parts of it. I'd believe you that it may not wear down the material of the platter. But when it is in research for 20 years or more, does it mean it is ready yet? And a laser or microwave device is another part that can die. I'd hope one would still be able to read the disk, but if not there is no way you can just put a new laser diode in there. Also interesting about the laser disk, I thought of it more like being written similar to a CD. Using more platters I wo

  • When the AI bubble finally bursts, there will be lots of cheap HDDs.
  • When the next generation of larger drives comes, you can buy the (then) medium sized of the LLM companies used. And as the usage pattern is probably WORM they could still be quite good.

  • I've been buying Huawei SSD and hard disk. I buy tape in tens of petabytes, disk in petabytes and SSD in hundreds of terabytes. Huawei is shipping cheap 64TB SSDs, but they need Huawei backplanes. So $120k gets you started with 100TB across 3 controllers and dual hundred gig switches. Growth is much cheaper. It looks like about $260k per petabyte. I'm paying about $100K per PB for hard disk but on 18TB drives. I expect $80K for 30TB drives when we switch. But that will put 3PB in a single chassis which at 2

The absent ones are always at fault.

Working...