Hard Drive Shortage Intensifies as AI Training Data Pushes Lead Times Beyond 12 Months (tomshardware.com) 24
Lead times for high-capacity hard drives have exceeded 52 weeks as AI workloads drive unprecedented demand for warm storage that sits between fast SSDs and offline tape archives, according to TrendForce. Western Digital notified customers of price increases across its entire hard drive portfolio citing demand for "every capacity" in its product line.
The shortage stems from AI infrastructure requirements including training datasets, model checkpoints and inference logs that consume petabytes of storage space. These files are too large for primary SSD storage but must remain accessible for quick retrieval. Hard drive manufacturers have not significantly expanded production capacity in approximately a decade. Cloud service providers are evaluating QLC SSDs for cold data storage despite costs remaining four to five times higher per gigabyte than mechanical drives. Memory suppliers are developing SSD products specifically for this intermediate storage tier.
The shortage stems from AI infrastructure requirements including training datasets, model checkpoints and inference logs that consume petabytes of storage space. These files are too large for primary SSD storage but must remain accessible for quick retrieval. Hard drive manufacturers have not significantly expanded production capacity in approximately a decade. Cloud service providers are evaluating QLC SSDs for cold data storage despite costs remaining four to five times higher per gigabyte than mechanical drives. Memory suppliers are developing SSD products specifically for this intermediate storage tier.
Too large? (Score:2)
These files are too large for primary SSD storage but must remain accessible for quick retrieval
I think you mean "not used often enough to warrant the price of SSD storage" not "too large for SSD storage."
If you think a multi-petabyte file is too big to fit on SSD storage: combining multiple physical storage devices into one virtual device has been a thing for a long time now.
Re: (Score:3)
it's indeed weird, they're throwing money around like crazy for the training race, why would they bother with even a 200% cost increase for storage? not to mention that ssd is cheaper energy wise in the long run.
i have not looked at the figures, but maybe the increased production times have to do with production capacity or supply chain problems?
Re:Too large? (Score:4, Informative)
1. QLC SSDs are roughtly 4X the cost of HDD for a given amount of space.
2. While power is lower for SSDs, the cost savings does not even come close to the up front cost.
2a. The power for storage is literally in the noise compared to the GPU/AI compute power requirements.
3. The "other option" would be to use more older (ie smaller) HDDs, and even off-lease mining HDDs, not QLC SSDs.
4. NAND capacity is limited like HDD capacity in terms if exabytes. HDDs are still larger, although it is getting closer. Neither has the ability to ramp up quickly.
The people building these system do all the math. It is not knee jerk. They know every detail of reliability, power, space, availability, duty cycles, and multi vendor suppliers and take all these into account.
Re: (Score:2)
Idle power for HDDs can be very low if the spindles are not kept spinning. For training that takes many days, according occasional HDD spin up times of 10 seconds may be acceptable. In that case, it's basically only the HDD electronics board that is powered. And when comparing similar capacities, multiple SSDs are needed to equal the capacity of one HDD, so it's not clear that in idle mode SSDs consume less power than HDDs.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
They are always kept spinning during training. If the server has enough enough disk cache, they will only be hit when written to. However a 320GB model being trained might still be using 16 20TB drives for the actual data being trained on if they're keeping it around and not just ripping it from servers on the fly.
Re: (Score:2)
The most effective system for many tasks is HDD + SSD cache. And that's also quite cost effective. And you can scale it dynamically. The HDD is still too small? Add another SSD as cache. It's fast enough? Just keep with one.
And I don't think power is a factor at all. As server PSU probably has more energy loss due to not being 100% efficient than the hard drive needs. For a company investing a huge amount of energy in GPU computing that's the least concern. If they care at all, then because each of the watt
Re: (Score:2)
Nah. Most models in use, are still only about 500GB tops. That's perfectly suitable for a SSD to load for inference.
The problem is training because checkpoints are 500GB per iteration. So you may write a checkpoint every X many iterations and take the best one, so you aren't doing this with a 2TB SSD. The end result is that if you are trying to get the best model, you can't just set it and forget it, you have to set it and check it, and if you're throwing away 10 checkpoints without checking you might overs
Re: (Score:2)
combining multiple physical storage devices into one virtual device has been a thing for a long time now.
Apart from the price tag: You would need many more ports for connecting SSDs on the server than you would need for HDDs with a higher capacity each.
Not for long (Score:1)
"developing products"? (Score:2)
While I welcome our cheaper but slower SSD overlords, isn't the storage the price driver?
Would 4TB of NAND (or whatever) care what you wrap around it?
Re: (Score:2)
More production usually means low prices (Score:2)
But apparently WesternDigital also wants its share of the AI pie by increasing their prices.
Re:More production usually means low prices (Score:4, Insightful)
Yes everything looks like a conspiracy to the short sighted. The reality is there was zero reason to expand HDD production at a time when more and more data storage requirements trended towards SSDs. Sales of HDDs have plummeted since their peak in 2010, and it sure as heck isn't worth attempting to predict a bubble 3 years out and investing $1bn to expand HDD production because maybe some AI techbros will briefly hoover up data before their industry implodes.
Re: (Score:2)
EAMR seems also to be such a hack. I wonder if they find a better method or if SSDs make the price/value race before that.
Re: (Score:2)
EAMR seems also to be such a hack.
Everything we've ever created is "such a hack". It's the application of physics in ways to solve a problem. EAMR is no more a hack than changing the magnetic head orientation is. Or changing the size of the write head, or the material of the platter, or making heads aerodynamic. It's just engineering.
Re: (Score:2)
Maybe. Still it "feels" like heating up the surface is some rather crude workaround and would increase wear and tear?
Re: (Score:2)
There's nothing to wear on in under underlying material. The heat here allows for changes in how magnetic fields respond, the temperature effect on magnetic structure is well understood. The only thing new here is how incredibly miniature it is and how they found a way to focus the heat into the required spot.
It's not a hack, it's R&D, science and engineering and something that has been actively researched in spindle HDDs for 20 years now, and has been the subject of research for nearly 70 years. Check
Re: (Score:2)
Fair point, but one can still question some parts of it. I'd believe you that it may not wear down the material of the platter. But when it is in research for 20 years or more, does it mean it is ready yet? And a laser or microwave device is another part that can die. I'd hope one would still be able to read the disk, but if not there is no way you can just put a new laser diode in there. Also interesting about the laser disk, I thought of it more like being written similar to a CD. Using more platters I wo
Oh good (Score:2)
Wait for the cheap used drives (Score:2)
When the next generation of larger drives comes, you can buy the (then) medium sized of the LLM companies used. And as the usage pattern is probably WORM they could still be quite good.
Huawei is closer to 2.5x the price (Score:2)