Cerebras' Wafer-Size Chip Is 10,000 Times Faster Than a GPU (venturebeat.com) 123
An anonymous reader quotes a report from VentureBeat: Cerebras Systems and the federal Department of Energy's National Energy Technology Laboratory today announced that the company's CS-1 system is more than 10,000 times faster than a graphics processing unit (GPU). On a practical level, this means AI neural networks that previously took months to train can now train in minutes on the Cerebras system.
Cerebras makes the world's largest computer chip, the WSE. Chipmakers normally slice a wafer from a 12-inch-diameter ingot of silicon to process in a chip factory. Once processed, the wafer is sliced into hundreds of separate chips that can be used in electronic hardware. But Cerebras, started by SeaMicro founder Andrew Feldman, takes that wafer and makes a single, massive chip out of it. Each piece of the chip, dubbed a core, is interconnected in a sophisticated way to other cores. The interconnections are designed to keep all the cores functioning at high speeds so the transistors can work together as one. [...] A single Cerebras CS-1 is 26 inches tall, fits in one-third of a rack, and is powered by the industry's only wafer-scale processing engine, Cerebras' WSE. It combines memory performance with massive bandwidth, low latency interprocessor communication, and an architecture optimized for high bandwidth computing.
Cerebras's CS-1 system uses the WSE wafer-size chip, which has 1.2 trillion transistors, the basic on-off electronic switches that are the building blocks of silicon chips. Intel's first 4004 processor in 1971 had 2,300 transistors, and the Nvidia A100 80GB chip, announced yesterday, has 54 billion transistors. Feldman said in an interview with VentureBeat that the CS-1 was also 200 times faster than the Joule Supercomputer, which is No. 82 on a list of the top 500 supercomputers in the world. [...] In this demo, the Joule Supercomputer used 16,384 cores, and the Cerebras computer was 200 times faster, according to energy lab director Brian Anderson. Cerebras costs several million dollars and uses 20 kilowatts of power.
Cerebras makes the world's largest computer chip, the WSE. Chipmakers normally slice a wafer from a 12-inch-diameter ingot of silicon to process in a chip factory. Once processed, the wafer is sliced into hundreds of separate chips that can be used in electronic hardware. But Cerebras, started by SeaMicro founder Andrew Feldman, takes that wafer and makes a single, massive chip out of it. Each piece of the chip, dubbed a core, is interconnected in a sophisticated way to other cores. The interconnections are designed to keep all the cores functioning at high speeds so the transistors can work together as one. [...] A single Cerebras CS-1 is 26 inches tall, fits in one-third of a rack, and is powered by the industry's only wafer-scale processing engine, Cerebras' WSE. It combines memory performance with massive bandwidth, low latency interprocessor communication, and an architecture optimized for high bandwidth computing.
Cerebras's CS-1 system uses the WSE wafer-size chip, which has 1.2 trillion transistors, the basic on-off electronic switches that are the building blocks of silicon chips. Intel's first 4004 processor in 1971 had 2,300 transistors, and the Nvidia A100 80GB chip, announced yesterday, has 54 billion transistors. Feldman said in an interview with VentureBeat that the CS-1 was also 200 times faster than the Joule Supercomputer, which is No. 82 on a list of the top 500 supercomputers in the world. [...] In this demo, the Joule Supercomputer used 16,384 cores, and the Cerebras computer was 200 times faster, according to energy lab director Brian Anderson. Cerebras costs several million dollars and uses 20 kilowatts of power.
Powering the chip (Score:4, Funny)
So how is it powered? 1.2V at 800A? :)
Re: (Score:3)
At any rate, wouldn't want to be "that guy" who goofed up clean room procedures and ruined the whole thing with a smidgen of dandruff.
Re:Powering the chip (Score:4, Informative)
1 million copper posts.
https://spectrum.ieee.org/semi... [ieee.org]
Re:Powering the chip (Score:5, Informative)
I suppose the scale of the supporting equipment is surprisingly small, considering the amount of material dedicated to powering/cooling/interfacing/mounting a conventional processor die.
Re: (Score:2)
Well what is shown here is not the cooling, just the heat transfer mechanism. Using water is a great option there but you still need a radiator able to dissipate 20kW somewhere and that's not shown in the picture. Watercooling scales really well due to it's high thermal mass for a typical 300W GPU you need a copper block half a cm thick. The heat per square cm coming off this thing isn't much more than a typical GPU so you just need to make the block longer, wider, and keep flow high enough that the water d
Re: (Score:2)
Re: (Score:2)
On the topic of cooling the giant die, also have to pay careful attention to stress from thermal expansion/contraction, so water cooling seems like the only sane option (able to regulate flow to numerous points if desired, or as you said,
Re: (Score:2)
Holy smokes, wow. The engineering challenge to deliver that much current must have been incredible.
Re: (Score:2)
Damn, you beat me to it.
Re:Powering the chip (Score:4, Insightful)
large etched aluminum rails running about the place.
Aluminum is the wrong material to use.
Copper has nearly twice the electrical conductivity and will also do a better job of transferring heat out.
Gold or silver would be even better.
Compared to the cost of fabbing this monster, the cost of the precious metals would be minimal.
Re: (Score:2)
They probably didn't think of that. How foolish of them.
Re: (Score:2)
They probably didn't think of that. How foolish of them.
Actually, they did think of it. They don't use aluminum.
Re: (Score:2)
Gold or silver would be even better.
Silver would be better than copper for both electrical and thermal conductivity, but gold isn't as conductive as copper by either measure, and is closer to aluminum than it is to copper for electrical conductivity. Gold's main benefit in electronics is its inertness.
Re: (Score:3)
And also its softness, which makes it easier to bond. You can use palladium-coated copper to get corrosion protection, but it's still harder than gold. However, it is cheaper.
The die bonding wires have more conductivity than the semiconductor paths, so it hardly seems to matter what you use in those terms. What matters most is the cost vs. the % of failed bonds. We use gold because it's easy and reliable, we use other things when they are cheaper. We have made strides in making using other things more relia
Re: (Score:2)
Aluminum is the wrong material to use.
It might seem that way, but my understanding is that copper has a tendency to spoil the electrical characteristics of silicon, so aluminum is used. The resources from which I learned this are old, however, so I'd be interested to know if this problem has ever been solved.
Re: (Score:2)
A thin layer of copper silicide [wikipedia.org] can act as a diffusion barrier.
Re: (Score:2)
Re: (Score:1)
I'm not seeing how your calculation adds up to the 20,000 watts of power that this 1-foot diameter chip uses.
Re: (Score:2)
Re: (Score:3)
20,000 watts of power that this 1-foot diameter chip uses.
A typical electric stove burner uses 1500 watts. So this uses 13 times that amount of power.
Holy cow.
Re: (Score:2)
I was thinking along the lines of 20 space heaters at medium setting, but that works too. Either way, it's got to have effective cooling or it's going to fry quickly with that much heat inside that small space. It's challenging, but not ridiculous and indeed TFA has a picture with the caption: "The cooling system takes up most of the CS-1. The WSE chip is in the back left corner. "
Re: (Score:2)
Re: (Score:3)
No. 0.8V at 20,000A. It takes quite an exotic interconnect.
https://spectrum.ieee.org/semi... [ieee.org]
Re: (Score:2)
0.8V at 20,000A, apparently:
https://spectrum.ieee.org/semiconductors/processors/cerebrass-giant-chip-will-smash-deep-learnings-speed-barrier
The WSE’s 1.2 trillion transistors are designed to operate at about 0.8 volts, pretty standard for a processor. There are so many of them, though, that in all they need 20,000 amperes of current. “Getting 20,000 amps into the wafer without significant voltage drop is quite an engineering challenge—much harder than cooling it or addressing the yield problems,” says Lauterbach.
Don't put it in the wrong way (Score:1)
I'm not sure if you want to spend a few million dollars on a single chip-system, besides the single point of failure, even if you have the money, supercomputing designers don't just forklift the entire system every few years.
Most designs simply cycle their individual nodes on a schedule with faster nodes and likewise can choose to leverage nodes if it actually needs it. This system will also need special programming and only be useful for a single type of workload. Not very useful for most markets that need
Re:Don't put it in the wrong way (Score:4, Insightful)
single point of failure
Individual cores can be disabled.
This system will also need special programming
So does a TPU. But all the gnarly stuff is wrapped up in Python libraries. The user model is dead-simple. My 16-year-old son is using cloud-based TPUs for a high school science project.
only be useful for a single type of workload.
Yup, it is only useful for a single market ... worth $100 billion (and growing).
Re: (Score:2)
If you cant spend a million dollars on a dedicated neural network accelerator, its probably not for you.
This things targetted at big laboratories. Like, OpenAI trying to figure out how to get around the projected Billions of dollars of GPU time thats required to get GPT4 up. THOSE people are the target market..
Re: (Score:2)
>
More distributed, less powerful but specially designed nodes is where the market is going right now.
Except if you want to test more models than distributed, less powerful machines will allow you. If this lets you test 20 models in the time it took you to do one before, you may gain an advantage. There may not be a huge market for this, but I can see that there's enough demand from a number of industries that they will probably sell quite a few.
Yields? (Score:4, Interesting)
Re: (Score:3)
if the chip were split up into 100 scalable components that can be fused off, then yield issues in a region turn into a 1% hit on performance. More likely the repeated random logic is present in many orders more of magnitude than that.
Still, yield is an issue at this scale. You could take a hit where a chip is not viable. If some bus is not redundant enough, or if there are some centralized clock logic (PLLs?) or I/O that can't be handled redundantly, then you lose a whole wafer. Adds up to some pretty mass
Re: (Score:2)
If you're using 5N isotopically pure silicon, you'll get the same yields as with chips.
If you're using regular silicon, wafer scale integration (WSI) typically gives you 30-50%.
Re: (Score:2)
The article from IEEE mentioned being able to deal with a 1% failure rate on the cores, not sure if that's what they are actually seeing.
Where are the gains coming from? (Score:5, Interesting)
One answer might be defect rates. AMD gets like 6.5% defect rates on its 8-core chips, but with the equivalent of 400-800 cores per wafer, that defect rate must geometrically approach 100% [wolframalpha.com], that is unless Cerebras has some ultra-low defect process, or some means of dynamic compartmentalization.
Re: (Score:2)
Each piece of the chip, dubbed a core, is interconnected in a sophisticated way to other cores. The interconnections are designed to keep all the cores functioning at high speeds so the transistors can work together as one
so I'm guessing that they create a lot of cores per wafer, with fusible interconnects between them. When they find a defect in a core, they remove it from the wafer via the fusible interconnects. Since the interconnects themselves could have flaws, presumably there are ways to work around that as well (hence the adjective 'sophisticated').
Re:Where are the gains coming from? (Score:4, Informative)
https://spectrum.ieee.org/semi... [ieee.org]
If your chip is nearly the size of a sheet of letter-size paper, then you’re pretty much asking for it to have defects.
But Lauterbach saw an architectural solution: Because the workload they were targeting favors having thousands of small, identical cores, it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip.
I'm still confused how this innovation wasn't figured out by IBM, ARM, etc. Oh well, good for Cerebras. Hoping for a chip cheaper than seven figures down the road!
Re: (Score:2)
Re: (Score:2)
I guess it was figured out.
But neither ARM not IBM are into "neural network processors" at the moment.
Re: (Score:2)
But neither ARM not IBM are into "neural network processors" at the moment.
The big three are Google, Tesla, and Nvidia.
Re: (Score:3)
Re: (Score:2)
Can you explain why you would want a chip like this at all? Why not just use a load of smaller chips?
Seems like having one massive chip brings a lot of problems (like powering and cooling it) so there must be some very specific workloads that benefit from physically close integration, and which is still somehow cheaper than having many smaller parts doing the same work.
Re: (Score:3)
There are 400,000 cores on this chip. Let's say you can pack 16 of these cores into one chip (Amazon is only putting 4 cores into their ML chips) ... that's still 25,000 chips. Let's say you can put 8 chips into one server, that's 3,125 servers. Let's say you can put those servers into 1U cases. Most racks can accommodate 42U. That's 75 racks for what they're getting done in 1/3 rack.
I invented some of these numbers, namely the 16 cores, the 8 chips, and the 1U. Odds are that the 16 and 8 would be smaller,
Re: (Score:2)
I think your numbers must be a bit off, it's wafer scale so that would be around 500 mid range x86 CPUs worth of die area. If they have 400,000 cores in the same space they must be vastly simpler cores than mid range x86 CPUs.
This suggests that you would see around 500 cores per chip. They will need more space but perhaps not that much more space, as this thing needs interconnects and power delivery as well.
They must be targeting a very, very specific and specialist application to make this worthwhile and I
Re: (Score:2)
I wonder if in this successful part if massive interconn
Re: (Score:2)
But I mean what was it about the application that meant the interconnect was so critical and that having a possibly larger number of independent (and much easier to make) cores would have been inferior?
What did you end up using?
Re: (Score:2)
Re: (Score:2)
It was. Sir Clive Sinclair even wrote a paper on it for New Scientist in the 1980s. It's not a new idea. It's just a new implementation.
Re: (Score:2)
I'm still confused how this innovation wasn't figured out by IBM, ARM, etc.
Maybe the heat? I don't find any indication of the frequency, and these Cerebras might be running at a lower frequency (which might still beat a regular processor, the brain has a hyper low frequency).
Re: (Score:3)
I'm still confused how this innovation wasn't figured out by IBM, ARM, etc.
Because it's not innovation. It's a simple engineering problem. The innovation here is finding a customer willing to spend 7 figures on a single CPU.
GPU companies have been using this approach for decades. The RTX3090 has 10496 cores. The question is how much is it worth scaling for your business model compared to the number of potential customers there are out there.
Re: (Score:2)
I thought those many GPU cores were first divided up from their wafers then reassembled?
Re: (Score:2)
Re: (Score:2)
It was well known that the larger the chip, the lower the yield. It's why shrinking a chip makes it exponentially cheaper. Chips that are area-driven, like memory often have to have mitigations to handle the fact that the larger the die, the greater chance of a fault. Flash memory and camera chips combat this by declaring you can have bad blocks and bad pixels in the sensor array, just deal with it.
Also remember that the density of tr
Re: (Score:2)
Sir Clive Sinclair developed the fundamental technique in the 1980s. An actual good idea from him.
Re: (Score:1)
Well, the $2.5 million that it costs has to be for something.
Simple.. (Score:1)
Faster than *a* gpu, rather than the 100-200 gpus that could come from the same wafer, AND its dedicated AI accelerator, which is a small fraction of what a GPU spends silicon on.
So, it would be surprising if it was *not* so much faster.
I am sure however it also has its limitations - algorithms that map better to a cluster of GPUs than to their archiecture...
None of this should be a surprise, including the marketing hype.
Re: (Score:2)
the interconnect between GPUs is a major bottleneck. And for really big scales you need some specialized chips just to provide a switch fabric between many GPUs.
Re: (Score:2)
the interconnect between GPUs is a major bottleneck. And for really big scales you need some specialized chips just to provide a switch fabric between many GPUs.
Depends what you're doing with the GPUs. The bitcoining motherboards for example (https://www.techradar.com/uk/news/best-mining-motherboards) have a load of PCIe x1 slots. You ship a small amount of data to a GPU, crunch for a while and get a small amount of data back.
Deep learning isn't quite that easy because the amounts of data tend to be much la
Re: (Score:2)
That Chip has 1000nds of cores.
The defect ones are simply disabled, and some smart routing avoids them.
Re: (Score:3)
I suspect a large part of the speed up is from their processor being optimized for sparse matrices, they aggregate non zero weights during matrix multiplication and only do floating point arithmetic for those.
1 trillion perfect transistors? (Score:2)
So they must be wiring these things in some sort of redundant way because it seems unlikely one gets a trillion perfect transistors. I wonder how they do that?
Re: (Score:2)
Re: (Score:2)
Lots of latency due to distance. When using WSI, your distances are cut to about 1/000th. That's dramatic latency reduction right there.
Then there's the interconnects themselves. When you take it off chip, you're going via solder via gold via more solder via whatever the pins are soldered to (aluminium?) via the tracks on the PCB and back up to the chip. All those connections produce loss of signal. And that's if you're going to just the next chip. If you're talking about memory to all the support chips to
Re: (Score:2)
I'm really quite excited by this, if it's reality instead of hype. nVidia is charging extortionate rates for their compute gear and they need a boot up the rear from some good competition.
Re: (Score:2)
Here is some interesting info:
https://community.cadence.com/... [cadence.com]
https://community.cadence.com/... [cadence.com]
Re: (Score:2)
1) purpose built hardware is significantly faster than generalized.
2) MASSIVE bottlenecks occur on interconnects
3) speed of light has been a factor for almost half a century. distance matters the more you push the limits. how far does light travel in 1 clock tick now? latency matters in some operations.
4) RAM... I'd suspect they have designed in just the right amount of on-board RAM for their needs
5) a single chip with enough functioning cores... one reason it costs so much, is likely because they produce
Re: (Score:2)
why is it not the standard?
From the summary: "Cerebras costs several million dollars"
Re: (Score:2)
But will it run Crysis? (Score:2)
Re: (Score:2)
Anybody who says they are not a racist is in denial and lying.
Race is a social construct without a scientific basis.
Can't be racist if you don't believe in race.
You can believe that people with certain characteristics are more likely to do x without being racist, so long as your reason is that they are in a different situation from other people, and not for genetic reasons.
Re: (Score:2)
25x times the silicon, 400x the speed?! (Score:2)
This sounds like an IBM-Watson type of marketing division going crazy with the numbers.
Re: (Score:2)
Latency is your biggest issue with speed. That's why scale matters so much. Your second killer is noise. The third big killer is buffering/synchronization. Finally, there's memory speed. Here, distances are slashed like anything, much less room for noise, and you can avoid most of the problems that plague CPU daughterboards.
Re: (Score:2)
400x the power..
How many of the cores are defective? (Score:2)
OK, so, how many of the cores are defective and deactivated on that huge wafer?
Take your time. I'll hang up and listen for your answer.
Re: (Score:2)
Depends on the grade of silicon, now that cheap 5N isotope pure silicon is cheap.
If they're using high-end, 1% or less.
If they're using typical, might be 30%.
Re: (Score:2)
According to this article, " it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip".
https://spectrum.ieee.org/semi... [ieee.org]
Re: (Score:2)
Yep, brain-farted on that one in the article.
Core type? (Score:1)
I read TFA. Does not seem to be mentioning the CPU instruction set. x86-Arm-Power, etc? Sounds like a project Microsoft JEDI thing, along with the recent /. article on secure core on wafer with CPU. And the Apple M1 chip.
Re: Core type? (Score:2)
Yay? (Score:2)
Cerebras' Wafer-Size Chip Is 10,000 Times Faster Than a GPU ... On a practical level, this means AI neural networks that previously took months to train can now train in minutes on the Cerebras system.
So... Skynet [wikipedia.org] or Colossus [wikipedia.org] can come 10,000 times faster now -- hurray?
[ Threw that second one in for the youngsters. :-) ]
Re: (Score:2)
Translated:
Instead of taking 51,000 seconds to simulate 1 second of activity on 100,000 neurons, it will now take only 5.1 seconds. So 1/5th the speed of the human brain.
I'd love to see Terminator re-worked to this speed.
Re: (Score:2)
So... Skynet or Colossus can come 10,000 times faster now -- hurray?
I presume you meant "sooner" and not "faster". Although, you know, rule 34.
Re: (Score:2)
So... Skynet or Colossus can come 10,000 times faster now -- hurray?
I presume you meant "sooner" and not "faster". Although, you know, rule 34.
Right, and with "fewer" not "less" distractions. :-)
#Thanks-for-the-grammer-slam-I-should-know-better-my-wife-was-an-English-teacher
(For old timeâ(TM)s sake) (Score:2)
Can you imagine a Beowulf cluster of these?
Re: (Score:2)
Re: (Score:2)
I came to the comments to post this. And I woke up my four-digit slashdot ID after about a five-year sleep...
Wait, wait, let him finish ... (Score:3, Funny)
... 10 000 times faster than a GPU from 1996.
Re: (Score:2)
Re: (Score:2)
Faster than at doing what? (Score:4, Insightful)
What is it that the Cerebras wafer is fast at doing? It certainly wasn't MLPerf, as the company didn't submit any results, saying, "We did not spend one minute working on MLPerf, we spent time working on real customers." Of course, such bravado would be more impressive with a substantial list of customers.
Yes, the are a lot of reasons to criticize benchmarks and benchmark results. However, there's a lot more to criticize with the opacity that comes with a lack of any comparative benchmarks. Yes, the component counts are impressive. But, projected performance based on component counts are theoretical. That's why the Top500 list differentiates between max and peak numbers. You'd think that a industry upstart with a true market-changing product would be eager to demonstrate how much faster their system is. Perhaps Cerebras has real systems that demonstrate real performance to their secret customers. However, guarding these results from the public just seems a tad bit fishy.
Re: (Score:2)
I'd never even heard of MLPerf until today. I work with DL and have shipped stuff with it.
This CPU is wicked smaaht (Score:2)
It can calculate the asteroid's moment of impact, while still in the factory.
It can figure out the crypto-wallet password of that recently deceased startup billionaire before the hooker calls in to report a suicide
Wafer Scale Integration (WSI) (Score:2)
Has been tried many times since the 1970s, 80s,90s. There is a good Wikipedia article about it. Even Clive Sinclair's company tried to make this work.
Well done them if they have made this work. Personally I'd want a Wafer Scale SSD!
Beowulf Cluster (Score:5, Insightful)
Seriously, a super computer on a single chip, and no one has suggested combining them together? What is Slashdot these days...
Re: (Score:3)
Seriously, a super computer on a single chip, and no one has suggested combining them together? What is Slashdot these days...
It already *IS* the Beowulf cluster. Putting a cluster of these together would be a Beowulf clusterfuck.
Re: (Score:2)
Seriously, a super computer on a single chip, and no one has suggested combining them together? What is Slashdot these days...
But does it run Linux? And did NetCraft confirm it??
Must have a fantastic yield (Score:2)
I guess the pricetag for the successful parts just absorbs all of the ones that didn't make it.
Or in computing terms... (Score:2)
That's cool (Score:2)
Cerebras' Chip Is 10,000 Times Slower Than GPU at (Score:3)
Million monkeys (Score:3)
A million monkeys, in principle, will write Shakespeare's plays a million times faster than one monkey.
Actual performance of the device depends on more than just the sum of processing power. Massively parallel processing has issues with interprocessor communications.