Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Graphics Hardware Technology

Cerebras' Wafer-Size Chip Is 10,000 Times Faster Than a GPU (venturebeat.com) 123

An anonymous reader quotes a report from VentureBeat: Cerebras Systems and the federal Department of Energy's National Energy Technology Laboratory today announced that the company's CS-1 system is more than 10,000 times faster than a graphics processing unit (GPU). On a practical level, this means AI neural networks that previously took months to train can now train in minutes on the Cerebras system.

Cerebras makes the world's largest computer chip, the WSE. Chipmakers normally slice a wafer from a 12-inch-diameter ingot of silicon to process in a chip factory. Once processed, the wafer is sliced into hundreds of separate chips that can be used in electronic hardware. But Cerebras, started by SeaMicro founder Andrew Feldman, takes that wafer and makes a single, massive chip out of it. Each piece of the chip, dubbed a core, is interconnected in a sophisticated way to other cores. The interconnections are designed to keep all the cores functioning at high speeds so the transistors can work together as one. [...] A single Cerebras CS-1 is 26 inches tall, fits in one-third of a rack, and is powered by the industry's only wafer-scale processing engine, Cerebras' WSE. It combines memory performance with massive bandwidth, low latency interprocessor communication, and an architecture optimized for high bandwidth computing.

Cerebras's CS-1 system uses the WSE wafer-size chip, which has 1.2 trillion transistors, the basic on-off electronic switches that are the building blocks of silicon chips. Intel's first 4004 processor in 1971 had 2,300 transistors, and the Nvidia A100 80GB chip, announced yesterday, has 54 billion transistors. Feldman said in an interview with VentureBeat that the CS-1 was also 200 times faster than the Joule Supercomputer, which is No. 82 on a list of the top 500 supercomputers in the world. [...] In this demo, the Joule Supercomputer used 16,384 cores, and the Cerebras computer was 200 times faster, according to energy lab director Brian Anderson. Cerebras costs several million dollars and uses 20 kilowatts of power.

This discussion has been archived. No new comments can be posted.

Cerebras' Wafer-Size Chip Is 10,000 Times Faster Than a GPU

Comments Filter:
  • by dskoll ( 99328 ) on Wednesday November 18, 2020 @07:17PM (#60740852) Homepage

    So how is it powered? 1.2V at 800A? :)

    • Came to ponder the same thing, considering that modern desktop processors can exceed 100A... I would imagine they bond many power wires and/or have large etched aluminum rails running about the place. I would also not be surprised if it was intended to work with many discrete VRMs.

      At any rate, wouldn't want to be "that guy" who goofed up clean room procedures and ruined the whole thing with a smidgen of dandruff.
      • Re:Powering the chip (Score:4, Informative)

        by mlyle ( 148697 ) on Wednesday November 18, 2020 @10:34PM (#60741406)

        1 million copper posts.

        https://spectrum.ieee.org/semi... [ieee.org]

        • Re:Powering the chip (Score:5, Informative)

          by thegreatbob ( 693104 ) on Wednesday November 18, 2020 @10:50PM (#60741438) Journal
          That's really slick, thanks. Found a page with an image (i believe it's a render) of an assembled module https://www.eetimes.com/poweri... [eetimes.com]

          I suppose the scale of the supporting equipment is surprisingly small, considering the amount of material dedicated to powering/cooling/interfacing/mounting a conventional processor die.
          • Well what is shown here is not the cooling, just the heat transfer mechanism. Using water is a great option there but you still need a radiator able to dissipate 20kW somewhere and that's not shown in the picture. Watercooling scales really well due to it's high thermal mass for a typical 300W GPU you need a copper block half a cm thick. The heat per square cm coming off this thing isn't much more than a typical GPU so you just need to make the block longer, wider, and keep flow high enough that the water d

            • True, the complete cabinet is a lot more substantial (15U). Even if their performance claims are off by a factor of 10 (and limited in scope), still pretty impressive density.
            • And yeah, there's really not that much more on a modern GPU board than my old Radeon 9800... probably about twice as many RAM chips and VRM components, and 5x as many decoupling capacitors. Only reason it's huge is the beefy Socket A cooler I drilled/tapped/mounted on it.

              On the topic of cooling the giant die, also have to pay careful attention to stress from thermal expansion/contraction, so water cooling seems like the only sane option (able to regulate flow to numerous points if desired, or as you said,
        • by dskoll ( 99328 )

          Holy smokes, wow. The engineering challenge to deliver that much current must have been incredible.

        • by cusco ( 717999 )

          Damn, you beat me to it.

      • by ShanghaiBill ( 739463 ) on Wednesday November 18, 2020 @11:21PM (#60741476)

        large etched aluminum rails running about the place.

        Aluminum is the wrong material to use.

        Copper has nearly twice the electrical conductivity and will also do a better job of transferring heat out.

        Gold or silver would be even better.

        Compared to the cost of fabbing this monster, the cost of the precious metals would be minimal.

        • by troon ( 724114 )

          They probably didn't think of that. How foolish of them.

          • They probably didn't think of that. How foolish of them.

            Actually, they did think of it. They don't use aluminum.

        • Gold or silver would be even better.

          Silver would be better than copper for both electrical and thermal conductivity, but gold isn't as conductive as copper by either measure, and is closer to aluminum than it is to copper for electrical conductivity. Gold's main benefit in electronics is its inertness.

          • And also its softness, which makes it easier to bond. You can use palladium-coated copper to get corrosion protection, but it's still harder than gold. However, it is cheaper.

            The die bonding wires have more conductivity than the semiconductor paths, so it hardly seems to matter what you use in those terms. What matters most is the cost vs. the % of failed bonds. We use gold because it's easy and reliable, we use other things when they are cheaper. We have made strides in making using other things more relia

        • Aluminum is the wrong material to use.

          It might seem that way, but my understanding is that copper has a tendency to spoil the electrical characteristics of silicon, so aluminum is used. The resources from which I learned this are old, however, so I'd be interested to know if this problem has ever been solved.

    • by quall ( 1441799 )

      I'm not seeing how your calculation adds up to the 20,000 watts of power that this 1-foot diameter chip uses.

      • You read the whole summary, you cheater!
      • 20,000 watts of power that this 1-foot diameter chip uses.

        A typical electric stove burner uses 1500 watts. So this uses 13 times that amount of power.

        Holy cow.

        • I was thinking along the lines of 20 space heaters at medium setting, but that works too. Either way, it's got to have effective cooling or it's going to fry quickly with that much heat inside that small space. It's challenging, but not ridiculous and indeed TFA has a picture with the caption: "The cooling system takes up most of the CS-1. The WSE chip is in the back left corner. "

    • by mlyle ( 148697 )

      No. 0.8V at 20,000A. It takes quite an exotic interconnect.

      https://spectrum.ieee.org/semi... [ieee.org]

    • by eth1 ( 94901 )

      0.8V at 20,000A, apparently:

      https://spectrum.ieee.org/semiconductors/processors/cerebrass-giant-chip-will-smash-deep-learnings-speed-barrier

      The WSE’s 1.2 trillion transistors are designed to operate at about 0.8 volts, pretty standard for a processor. There are so many of them, though, that in all they need 20,000 amperes of current. “Getting 20,000 amps into the wafer without significant voltage drop is quite an engineering challenge—much harder than cooling it or addressing the yield problems,” says Lauterbach.

  • Comment removed based on user account deletion
    • by ShanghaiBill ( 739463 ) on Wednesday November 18, 2020 @11:48PM (#60741546)

      single point of failure

      Individual cores can be disabled.

      This system will also need special programming

      So does a TPU. But all the gnarly stuff is wrapped up in Python libraries. The user model is dead-simple. My 16-year-old son is using cloud-based TPUs for a high school science project.

      only be useful for a single type of workload.

      Yup, it is only useful for a single market ... worth $100 billion (and growing).

    • If you cant spend a million dollars on a dedicated neural network accelerator, its probably not for you.

      This things targetted at big laboratories. Like, OpenAI trying to figure out how to get around the projected Billions of dollars of GPU time thats required to get GPT4 up. THOSE people are the target market..

    • by b0bby ( 201198 )

      >

      More distributed, less powerful but specially designed nodes is where the market is going right now.

      Except if you want to test more models than distributed, less powerful machines will allow you. If this lets you test 20 models in the time it took you to do one before, you may gain an advantage. There may not be a huge market for this, but I can see that there's enough demand from a number of industries that they will probably sell quite a few.

  • Yields? (Score:4, Interesting)

    by AutodidactLabrat ( 3506801 ) on Wednesday November 18, 2020 @07:26PM (#60740888)
    1%? 0.01%? Has to be something like that unless every single cell is spared with programmable routing
    • if the chip were split up into 100 scalable components that can be fused off, then yield issues in a region turn into a 1% hit on performance. More likely the repeated random logic is present in many orders more of magnitude than that.

      Still, yield is an issue at this scale. You could take a hit where a chip is not viable. If some bus is not redundant enough, or if there are some centralized clock logic (PLLs?) or I/O that can't be handled redundantly, then you lose a whole wafer. Adds up to some pretty mass

    • by jd ( 1658 )

      If you're using 5N isotopically pure silicon, you'll get the same yields as with chips.
      If you're using regular silicon, wafer scale integration (WSI) typically gives you 30-50%.

    • by b0bby ( 201198 )

      The article from IEEE mentioned being able to deal with a 1% failure rate on the cores, not sure if that's what they are actually seeing.

  • by DavenH ( 1065780 ) on Wednesday November 18, 2020 @07:31PM (#60740898)
    Something smells fishy. 10000x performance, just by not slicing the silicon wafer into smaller components. If the interconnection gains by keeping everything on one superchip were 4 orders of magnitude better than splitting them up, why is it not the standard?

    One answer might be defect rates. AMD gets like 6.5% defect rates on its 8-core chips, but with the equivalent of 400-800 cores per wafer, that defect rate must geometrically approach 100% [wolframalpha.com], that is unless Cerebras has some ultra-low defect process, or some means of dynamic compartmentalization.

    • They say

      Each piece of the chip, dubbed a core, is interconnected in a sophisticated way to other cores. The interconnections are designed to keep all the cores functioning at high speeds so the transistors can work together as one

      so I'm guessing that they create a lot of cores per wafer, with fusible interconnects between them. When they find a defect in a core, they remove it from the wafer via the fusible interconnects. Since the interconnects themselves could have flaws, presumably there are ways to work around that as well (hence the adjective 'sophisticated').

      • by DavenH ( 1065780 ) on Wednesday November 18, 2020 @07:53PM (#60740976)
        Right. This is a far more technically satisfying article I found:

        https://spectrum.ieee.org/semi... [ieee.org]

        If your chip is nearly the size of a sheet of letter-size paper, then you’re pretty much asking for it to have defects.

        But Lauterbach saw an architectural solution: Because the workload they were targeting favors having thousands of small, identical cores, it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip.

        I'm still confused how this innovation wasn't figured out by IBM, ARM, etc. Oh well, good for Cerebras. Hoping for a chip cheaper than seven figures down the road!

        • I wonder if it is very flexible to desired architecture of each neural net, or if such speedups are obtained by hardcoding a lot of stuff, so it couldn't efficiently handle e.g. sparse networks or only supports certain activation functions?
        • I guess it was figured out.
          But neither ARM not IBM are into "neural network processors" at the moment.

          • But neither ARM not IBM are into "neural network processors" at the moment.

            The big three are Google, Tesla, and Nvidia.

        • It has been tried numerous times. I imagine IBM tried it. Remember Trilogy? My company tried it, in fact I was the head design guy for the research part. I said it would not work (it did not) due to defect density. Triple redundant buses and modules that could be disabled. I imagine the thing that is changing from the 80's is defect densities are getting better. Good for them though. WSI brings all sorts of surprise challenges. Probably was a fun project.
          • by AmiMoJo ( 196126 )

            Can you explain why you would want a chip like this at all? Why not just use a load of smaller chips?

            Seems like having one massive chip brings a lot of problems (like powering and cooling it) so there must be some very specific workloads that benefit from physically close integration, and which is still somehow cheaper than having many smaller parts doing the same work.

            • There are 400,000 cores on this chip. Let's say you can pack 16 of these cores into one chip (Amazon is only putting 4 cores into their ML chips) ... that's still 25,000 chips. Let's say you can put 8 chips into one server, that's 3,125 servers. Let's say you can put those servers into 1U cases. Most racks can accommodate 42U. That's 75 racks for what they're getting done in 1/3 rack.

              I invented some of these numbers, namely the 16 cores, the 8 chips, and the 1U. Odds are that the 16 and 8 would be smaller,

              • by AmiMoJo ( 196126 )

                I think your numbers must be a bit off, it's wafer scale so that would be around 500 mid range x86 CPUs worth of die area. If they have 400,000 cores in the same space they must be vastly simpler cores than mid range x86 CPUs.

                This suggests that you would see around 500 cores per chip. They will need more space but perhaps not that much more space, as this thing needs interconnects and power delivery as well.

                They must be targeting a very, very specific and specialist application to make this worthwhile and I

            • Interconnect is one of the big ones. Even inter-reticle spacing is ridiculously tight compared to any other off-chip method. In our case there were also advantages because the redundancy was helpful for failures not from yield. The "chip" was for space use. I'd done back of the envelope calculations early on and just did not see our fab putting out such a part successfully. Today fabs are incredible in what they can produce. Nothing short of awesome.

              I wonder if in this successful part if massive interconn
              • by AmiMoJo ( 196126 )

                But I mean what was it about the application that meant the interconnect was so critical and that having a possibly larger number of independent (and much easier to make) cores would have been inferior?

                What did you end up using?

                • I think you are missing capacitance. Off chip means a driver needed and all the ESD protection associated with it, and the requisite power consumed to toggle the external pin cap up/down. In MOS technologies almost all power is switching power, which is directly proportional to cap driven. So by keeping all the interconnect on wafer, they are probably saving a few pf per line. In our case, it was strictly internal R&D which did not pan out. So they kept flying the way they did before. I don't know the d
        • by jd ( 1658 )

          It was. Sir Clive Sinclair even wrote a paper on it for New Scientist in the 1980s. It's not a new idea. It's just a new implementation.

        • I'm still confused how this innovation wasn't figured out by IBM, ARM, etc.

          Maybe the heat? I don't find any indication of the frequency, and these Cerebras might be running at a lower frequency (which might still beat a regular processor, the brain has a hyper low frequency).

        • I'm still confused how this innovation wasn't figured out by IBM, ARM, etc.

          Because it's not innovation. It's a simple engineering problem. The innovation here is finding a customer willing to spend 7 figures on a single CPU.

          GPU companies have been using this approach for decades. The RTX3090 has 10496 cores. The question is how much is it worth scaling for your business model compared to the number of potential customers there are out there.

          • by DavenH ( 1065780 )
            Then I guess this approach was waiting for the deep learning wave; there are now lots of customers for cloud computing to do deep analytics or AI applications.

            I thought those many GPU cores were first divided up from their wafers then reassembled?

        • How? Because it is horribly cost inefficient. Costs are usually per wafer in manufacturing, but sales are per chip. In this case IBM or AMD or Intel would have to sell a wafer sized chip for the same price as 100 or so chips to get any profit. And those companies are in it to make profit. It is not that no one has thought of it.
        • by tlhIngan ( 30335 )

          I'm still confused how this innovation wasn't figured out by IBM, ARM, etc

          It was well known that the larger the chip, the lower the yield. It's why shrinking a chip makes it exponentially cheaper. Chips that are area-driven, like memory often have to have mitigations to handle the fact that the larger the die, the greater chance of a fault. Flash memory and camera chips combat this by declaring you can have bad blocks and bad pixels in the sensor array, just deal with it.

          Also remember that the density of tr

      • by jd ( 1658 )

        Sir Clive Sinclair developed the fundamental technique in the 1980s. An actual good idea from him.

    • by quall ( 1441799 )

      Well, the $2.5 million that it costs has to be for something.

    • Faster than *a* gpu, rather than the 100-200 gpus that could come from the same wafer, AND its dedicated AI accelerator, which is a small fraction of what a GPU spends silicon on.

      So, it would be surprising if it was *not* so much faster.

      I am sure however it also has its limitations - algorithms that map better to a cluster of GPUs than to their archiecture...

      None of this should be a surprise, including the marketing hype.

      • the interconnect between GPUs is a major bottleneck. And for really big scales you need some specialized chips just to provide a switch fabric between many GPUs.

        • the interconnect between GPUs is a major bottleneck. And for really big scales you need some specialized chips just to provide a switch fabric between many GPUs.

          Depends what you're doing with the GPUs. The bitcoining motherboards for example (https://www.techradar.com/uk/news/best-mining-motherboards) have a load of PCIe x1 slots. You ship a small amount of data to a GPU, crunch for a while and get a small amount of data back.

          Deep learning isn't quite that easy because the amounts of data tend to be much la

    • That Chip has 1000nds of cores.
      The defect ones are simply disabled, and some smart routing avoids them.

    • I suspect a large part of the speed up is from their processor being optimized for sparse matrices, they aggregate non zero weights during matrix multiplication and only do floating point arithmetic for those.

    • So they must be wiring these things in some sort of redundant way because it seems unlikely one gets a trillion perfect transistors. I wonder how they do that?

      • For comparison, NVidia 2080 Ti has 18.6 billion transistors at a 12nm node. 67 of them would equal 1.2 trillion transistors. At 775mm^2 die size, a 300mm wafer has a max of 91 chips at 100% usage. Realistically it will be less due to practical implementations.
    • by jd ( 1658 )

      Lots of latency due to distance. When using WSI, your distances are cut to about 1/000th. That's dramatic latency reduction right there.

      Then there's the interconnects themselves. When you take it off chip, you're going via solder via gold via more solder via whatever the pins are soldered to (aluminium?) via the tracks on the PCB and back up to the chip. All those connections produce loss of signal. And that's if you're going to just the next chip. If you're talking about memory to all the support chips to

      • by DavenH ( 1065780 )
        Very interesting. I can see the non-linear gains from a lot of those aspects.

        I'm really quite excited by this, if it's reality instead of hype. nVidia is charging extortionate rates for their compute gear and they need a boot up the rear from some good competition.

    • Here is some interesting info:
      https://community.cadence.com/... [cadence.com]

      https://community.cadence.com/... [cadence.com]

    • 1) purpose built hardware is significantly faster than generalized.
      2) MASSIVE bottlenecks occur on interconnects
      3) speed of light has been a factor for almost half a century. distance matters the more you push the limits. how far does light travel in 1 clock tick now? latency matters in some operations.
      4) RAM... I'd suspect they have designed in just the right amount of on-board RAM for their needs
      5) a single chip with enough functioning cores... one reason it costs so much, is likely because they produce

    • why is it not the standard?

      From the summary: "Cerebras costs several million dollars"

  • Also how fast can it mine bitcoin?
  • This sounds like an IBM-Watson type of marketing division going crazy with the numbers.

    • by jd ( 1658 )

      Latency is your biggest issue with speed. That's why scale matters so much. Your second killer is noise. The third big killer is buffering/synchronization. Finally, there's memory speed. Here, distances are slashed like anything, much less room for noise, and you can avoid most of the problems that plague CPU daughterboards.

    • 400x the power..

  • OK, so, how many of the cores are defective and deactivated on that huge wafer?

    Take your time. I'll hang up and listen for your answer.

    • by jd ( 1658 )

      Depends on the grade of silicon, now that cheap 5N isotope pure silicon is cheap.

      If they're using high-end, 1% or less.
      If they're using typical, might be 30%.

    • by b0bby ( 201198 )

      According to this article, " it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip".

      https://spectrum.ieee.org/semi... [ieee.org]

  • I read TFA. Does not seem to be mentioning the CPU instruction set. x86-Arm-Power, etc? Sounds like a project Microsoft JEDI thing, along with the recent /. article on secure core on wafer with CPU. And the Apple M1 chip.

  • Cerebras' Wafer-Size Chip Is 10,000 Times Faster Than a GPU ... On a practical level, this means AI neural networks that previously took months to train can now train in minutes on the Cerebras system.

    So... Skynet [wikipedia.org] or Colossus [wikipedia.org] can come 10,000 times faster now -- hurray?

    [ Threw that second one in for the youngsters. :-) ]

    • by jd ( 1658 )

      Translated:

      Instead of taking 51,000 seconds to simulate 1 second of activity on 100,000 neurons, it will now take only 5.1 seconds. So 1/5th the speed of the human brain.

      I'd love to see Terminator re-worked to this speed.

    • So... Skynet or Colossus can come 10,000 times faster now -- hurray?

      I presume you meant "sooner" and not "faster". Although, you know, rule 34.

      • So... Skynet or Colossus can come 10,000 times faster now -- hurray?

        I presume you meant "sooner" and not "faster". Although, you know, rule 34.

        Right, and with "fewer" not "less" distractions. :-)

        #Thanks-for-the-grammer-slam-I-should-know-better-my-wife-was-an-English-teacher

  • Can you imagine a Beowulf cluster of these?

  • by Qbertino ( 265505 ) <moiraNO@SPAMmodparlor.com> on Wednesday November 18, 2020 @08:54PM (#60741176)

    ... 10 000 times faster than a GPU from 1996.

  • by larryjoe ( 135075 ) on Wednesday November 18, 2020 @10:07PM (#60741346)

    What is it that the Cerebras wafer is fast at doing? It certainly wasn't MLPerf, as the company didn't submit any results, saying, "We did not spend one minute working on MLPerf, we spent time working on real customers." Of course, such bravado would be more impressive with a substantial list of customers.

    Yes, the are a lot of reasons to criticize benchmarks and benchmark results. However, there's a lot more to criticize with the opacity that comes with a lack of any comparative benchmarks. Yes, the component counts are impressive. But, projected performance based on component counts are theoretical. That's why the Top500 list differentiates between max and peak numbers. You'd think that a industry upstart with a true market-changing product would be eager to demonstrate how much faster their system is. Perhaps Cerebras has real systems that demonstrate real performance to their secret customers. However, guarding these results from the public just seems a tad bit fishy.

  • It can calculate the asteroid's moment of impact, while still in the factory.

    It can figure out the crypto-wallet password of that recently deceased startup billionaire before the hooker calls in to report a suicide

  • Has been tried many times since the 1970s, 80s,90s. There is a good Wikipedia article about it. Even Clive Sinclair's company tried to make this work.

    Well done them if they have made this work. Personally I'd want a Wafer Scale SSD!

  • Beowulf Cluster (Score:5, Insightful)

    by nullchar ( 446050 ) on Thursday November 19, 2020 @12:04AM (#60741578)

    Seriously, a super computer on a single chip, and no one has suggested combining them together? What is Slashdot these days...

    • Seriously, a super computer on a single chip, and no one has suggested combining them together? What is Slashdot these days...

      It already *IS* the Beowulf cluster. Putting a cluster of these together would be a Beowulf clusterfuck.

    • Seriously, a super computer on a single chip, and no one has suggested combining them together? What is Slashdot these days...

      But does it run Linux? And did NetCraft confirm it??

  • I guess the pricetag for the successful parts just absorbs all of the ones that didn't make it.

  • It's the same computing power as a raspberry pi 10 in 2035...
  • That's cool. Is it web-scale, though? Does it support sharding?
  • by fennec ( 936844 ) on Thursday November 19, 2020 @10:24AM (#60742778)
    running Crysis
  • by groobly ( 6155920 ) on Thursday November 19, 2020 @01:18PM (#60743522)

    A million monkeys, in principle, will write Shakespeare's plays a million times faster than one monkey.

    Actual performance of the device depends on more than just the sum of processing power. Massively parallel processing has issues with interprocessor communications.

No line available at 300 baud.

Working...