Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI Hardware

Cerebras Systems' WSE-2 Chip: 2.6 Trillion Transistors + 850,000 Cores = 'the Fastest AI Processor on Earth' (siliconangle.com) 49

SiliconANGLE reports on why investors poured another $250 million into Cerebras Systems Inc: Enterprises typically use graphics processing units in their AI projects. The fastest GPU on the market today features about 54 billion transistors. Cerebras Systems' chip, the WSE-2, includes 2.6 trillion transistors that the startup says make it the "fastest AI processor on Earth."

WSE-2 stands for Wafer Scale Engine 2, a nod to the unique architecture on which the startup has based the processor. The typical approach to chip production is carving as many as several dozen processors into a silicon wafer and then separating them. Cerebras Systems is using a vastly different method: The startup carves a single large processor into the silicon wafer that isn't broken up into smaller units.

The 2.6 trillion transistors in the WSE-2 are organized into 850,000 cores...

Cerebras Systems says that the WSE-2 has 123 times more cores and 1,000 times more on-chip memory than the closest GPU. The chip's impressive specifications translate into several benefits for customers, according to the startup, most notably increased processing efficiency. To match the performance provided by a WSE-2 chip, a company would have to deploy dozens or hundreds of traditional GPU servers... With the WSE-2, data doesn't have to travel between two different servers but only from one section of the chip to another, which represents a much shorter distance. The shorter distance reduces processing delays. Cerebras Systems says that the result is an increase in the speed at which neural networks can run.

This discussion has been archived. No new comments can be posted.

Cerebras Systems' WSE-2 Chip: 2.6 Trillion Transistors + 850,000 Cores = 'the Fastest AI Processor on Earth'

Comments Filter:
  • What stops GPU manufacturers from replicating this tech if all the difference is the wafer wasn't broken up into literal chips? Haven't Nvidia and AMD been doing this for over a decade with crossfire/SLI over a serial bus? Seems like a cakewalk to simply integrate the discrete GPUs over that same bus on a single wafer.

    • by lorinc ( 2470890 )

      I think they already do. See Nvidia's A16: https://en.wikipedia.org/wiki/... [wikipedia.org] powered by 4xGA107 (RTX 3050 chip). I wouldn't' be surprised if they don't put 4 discrete chips together but rather slightly modify the original wafer to make a "multi-chip" chip.

      • I guess it is the architecture that houses the network to handle the data movements between the cores and on-die memory. The only scary part of this article, is that they really got the idea from the head of a robot that was crushed in a hydraulic press in their facility. No one knows how it got there, and it helped them see things in a unique way. All I can say is god speed Cyberdy⦠erm; I mean Cerebras.
    • This is probably possible but I have the feeling that the real reason that makes the Cerebras system so powerful (according to themselves) is that it is highly specialized for deep learning (so fast matrix multiplications and pretty much nothing else?). GPUs are more flexible. NVIDIA and AMD could probably improve their AI performance significantly by removing everything that is not needed for deep learning (so no more graphics, crypto, ...).

    • by OrangeTide ( 124937 ) on Sunday November 14, 2021 @11:24AM (#61987057) Homepage Journal

      We break it a wafer up because if there are defects we can throw those chips out. And when we characterize it is easier if there are only a few clock domains across the chip, and ideally we can bin chips according to their fasted clock.

      What is a little different now than say 20 years ago, is it is pretty common for a large chip to have fuses or other ways to disable features. When a defect is found, we can kill off some of the shader cores in a small group affected by the defect and still make use of the chip. This is a bit like Intel making the 486DX and turning it into a 486SX if the math-co didn't pass. But at a much finer granularity.

      While Nvidia broke some records a few years ago with a very large die for their Tesla V100 (815mm2 die, 21B transistors) it is no where near what wafer scale can do. Of course I'm confident that a Tesla V100 is cheaper, and by a factor greater the area difference would suggest.

      I wish I knew more about what Cerebras is doing in the manufacturing. They almost certainly have an interesting patent portfolio that could draw interest from Nvidia, AMD, and Intel.

    • by Gravis Zero ( 934156 ) on Sunday November 14, 2021 @11:35AM (#61987117)

      What stops GPU manufacturers from replicating this tech if all the difference is the wafer wasn't broken up into literal chips?

      The specific issue they had to overcome to accomplish this was figuring out how to align multiple exposures on the same wafer... without creating slow interconnects. This may seem trivial but reliably aligning something to a dozen nanometers or so is no easy feat. They don't have 100 of interconnecting lines, they have thousands of them.

      • I'm guessing that thermal stress across a substrate that large will be a bitch to manage. Provided they can get any working product off the end of the production line.
        • They have been working for years to get to an entire wafer chip, so they have already made them. As for cooling them, considering each chip cost several million dollars, there will be no expense spared to cool them.

          • They have already sold some of the previous generation which used a single wafer CPU array, so they know it's possible. These are just smaller and faster.

      • Hmm, I wonder if they used a mask containing a cluster of something like 64 cores on it, then spread interconnect pads along the sides of each cluster. Once the wafer is etched you connect each cluster along all four sides using the same technology as chip packaging. The gaps could be smaller than the cut lines for conventional wafers, only a fraction of a millimeter.

        Each core then has GHz asynchronous links in the X and Y directions to other nodes within the cluster, and onwards throughout the row and colu

        • Why would you need interconnects on the wafer? Just place the on chip interconnects in a way that they will get connected between the chiplets that cross the reticle border. Perhaps you need to get the foundry to accept some rules they normally don't, but other than that it shouldn't be an issue.
    • What stops people from copying it is all the ugly details of heat and signal path routing. Already the new Apple chips may be showing some signs of signal path distances as they scale from the M1 to the M1 max according to anandatech. Don't know if that's true but if it is this chip size monster will gave that issue to confront and work out . And if you are dynamically turning off cores to deal with localized heat your signal paths will be dynamic too. It's probably easier to deal with that in a SIMD s

    • Basically the fact that there didn't seem to be enough of a market for it.
      Basically, when for every game a single 628 mm squared RTX 3090 (some $3000) is enough, creating a 46225 square mm monster at $2+million a piece didn't seem to worth it.
      P.S. Manufacturing errors in some places of the GPU can easily be fixed (see the versions with fewer cores, rasters, memory controllers, other blocks i.e. parts of the chip broken and disabled). Yet, there are some areas where manufacturing errors can't probably be fix

  • imagine a beowulf cluster of these.

    It is an interesting problem. My guess is that yield of the individual clusters of cores and etc is not 100% for a while wafer, so the chances are that they all have a bunch of redundant on-chip networking and of course routing as well, so they can just disable the broken parts.

    What's interesting is how much heat this thing dissipates and how they deal with it.

    Also also, dense DRAM and CPUs are normally made on different processes (eDRAM is still not all that common), and

    • "a beowulf cluster of these" +1 funny was my first thought along with have they built one? and how much does it cost?
    • by ceoyoyo ( 59147 )

      Since they're breaking it up into so many small cores, I suspect you're right, and that their yield isn't particularly good.

      • The yield is very good - there was an Anandtech article about it, and the initial 1.5% extra cores to cover for possible defects was too much.
        Also, _any_ manufacturing defect can be "fixed" (i.e. any block can be fuzed out/disabled).

        As for the number of cores, I assume it's simply a matter of "this is the size/complexity needed for an AI thread, let's make the silicon 10% or 20% larger for a bit of futureproofing".

        • by ceoyoyo ( 59147 )

          850,000 * 1.5% is pretty terrible if any one of those failures results in tossing a whole wafer. So they break it up into lots of little cores so they can have thousands per processor fail and still have something useable. As I said.

          As for the number of cores, I assume it's simply a matter of "this is the size/complexity needed for an AI thread

          Assuming "AI" is an artificial neural network, unless a "core" is a multiply-add operation I can't see how that could be true. I think Occam's razor and your own numb

    • imagine a beowulf cluster of these.

      The WSE-2 will cost (quote) 'arm & leg' according to anandtech. [anandtech.com]

      So a beowulf cluster would be quite a few quadruple amputees.

  • [quote]Enterprises typically use graphics processing units in their AI projects.[/quote] Anybody doing AI projects is probably using GPUs...?
  • A single bad section of transistors and the entire wafer is wasted.

    Given then havenâ(TM)t provided a price, I would venture itâ(TM)s well out of the price range of 99% of individuals and targeted for large organizations ⦠such as governments and Google.

    • Cause, you know laser cutting of reroutable core designs have never been done before (especially for GPUs.)

      What makes you think that making the chip insanely large is any reason why this design rule changes?

    • a key piece of their tech is disabling and rerouting around bad cores so those kinds of issues are dealt with

      • by ceoyoyo ( 59147 )

        I wonder if they're going to auction the things.

        We'll start our auction today with a 637,129 core processor. Do I hear $10 million? ...
        And now the lot you've all been waiting for, an 816,000 core device....

    • They have zero rejects, as the design allows for around 1% of cores to be bad before the silicon would be rejected an this is a healthy enough margin to except ever wafer.

      Cost to them of producing each chip is around 10k. Cost to you for the 15U server (It needs a ton of power and even more cooling) is around 2.5 million dollars.

    • This is the version 2.
      The version 1 has a cost of $2+ millions (per something, possibly single-accelerator system).
      As for the price, if you have to ask, you're not rich enough.
      For now, it might be like the market for computers - 5 to 6 in USA, with maybe one in Europe.

  • If they come up with a way of stacking dram on another wafer, and more conventional cpus, your future home super computer would resemble a flying saucer, and you could use it for cooking too.
  • I think the "AI" in this context is the usual ML, but I wonder if we're getting closer (well, we're certainly not getting further away) to having the computational power to simulate neurons to any meaningful degree (or at least a significant number of them). Then the "AI" that's talked about may take on a whole new meaning.

    • by ceoyoyo ( 59147 )

      Neurons aren't magic. You can simulate a neuron on your phone. The behaviour of even the most complicated ones can be learned by quite simple artificial neural networks.

      https://www.biorxiv.org/conten... [biorxiv.org]

    • You could quite easily simulate neurons with FPGAs each with its own management processor with the complexity of a PDP8 (which is effectively the minimal conceivable general purpose logic). You might manage to supervise several FPGAs with each CPU - I have not tried to do it myself.

      In principle, you could use the Motorola 1-bit CPU instead, but you would need more glue logic, and lose the advantage of about 2 billion man years of programming experience to save a few gates - none of them Bill.

  • How about a GPU instead? They'd have to immerse the wafer in a tank of liquid Helium or something.

    • by djb ( 19374 )

      The production servers these go in are 15U, have 12PSUs, require around 27kW of power and doubt the 2.5 million dollar price tag includes Noctura fans.

      • Yeah, so it would be a good replacement for movie studio render farms. They would be able to render CGI movies like Avatar 100 times faster. Amazon or Google could make it available on the cloud for anyone to use for content creation or even cloud gaming.

  • Other than pattern recognition?
  • 850,000 cores ... or nodes?

    Let me calculate: Unless I erred, there's 2600000000000 / 850000
    = 3058823 transistors per core. I guess over 3 million transistors qualify for calling it a core! (^__^)

    • I guess over 3 million transistors qualify for calling it a core!

      For comparison:

      Suppose each core was an x86 type one. 3 Million transistors gets you in original Pentium (non-MMX) territory. But that ignores transistors used for the on-wafer RAM. For argument's sake, let's say that puts each core a step down, like 486DX class.

      So the whole wafer could then be described as >GHz-clocked 486DX class cores, each with small (~48 Kbyte) but crazy-fast RAM, and a crazy-high bandwidth interconnect between all these cores. In the order of 1 million of them... On a tile of

  • The other 2 died because they couldn't get the yield high enough to make it economical. This has a better chance since the market they are going after is already very pricey and likely to stay that way for 18 to 24 months. Eventually GPU prices will drop into reasonable levels. They better have their yield under control by then or they will have trouble.

  • by BobC ( 101861 ) on Sunday November 14, 2021 @12:43PM (#61987351)

    All wafer-scale efforts over the past 50 years have been surges in the battle between the costs and performance penalties of packaging and connecting smaller chips against the costs of yields limiting the number of wafers with enough functionality to sell at a premium.

    Typically, success was found only on older fab lines, where the yields were high and the low transistor density could still prove profitable at scale. Temporarily, as the yields of newer processes always improved as well. Moving wafer-scale projects across fab generations seldom works: The designs are often specific to one specific fab technology, often even just one specific fab line!

    I once worked with an old 2.5" line on the edge of being retired/replaced to produce a truly awesome sensor that used the full area of the wafer. My part was to help build the prototype instrument, then to characterize, calibrate and stabilize the sensor to get the most performance from it. The product was a success, and that one sensor kept that fab line running for nearly another decade, to the point where finding parts for the line equipment became a problem, when a massive endurance run was done to stockpile sensors. I've long since left that company, but they're still making, selling and supporting products using that sensor!

    • by djb ( 19374 )

      Their yield rate is 100%. The 850k figure includes a health buffer for disabled cores

  • This "AI" hype is basically just concentrated stupid. Sure, there are some nice applications of statistical classifiers, but we _know_ that we will never get AGI (the real "AI") that way, because we _know_ throwing more computing power at know approaches just makes them be stupid faster.

    • Damn straight.
    • Sure, there are some nice applications of statistical classifiers, but we _know_ that we will never get AGI (the real "AI") that way,

      Who cares? If they are successful at updating huge models more cost-effectively, stock-picking and self-driving cars alone will easily cost-justify this.

      Philosophizing about what intelligence "really" is tired and useless. Solve a new class of problems, and you will be rewarded.

      • by gweihir ( 88907 )

        Sure, there are some nice applications of statistical classifiers, but we _know_ that we will never get AGI (the real "AI") that way,

        Who cares? If they are successful at updating huge models more cost-effectively, stock-picking and self-driving cars alone will easily cost-justify this.

        Philosophizing about what intelligence "really" is tired and useless. Solve a new class of problems, and you will be rewarded.

        Huge models turn out to not work well. More hidden bad behaviors. Easier to sabotage. Impossible to audit. Sure, for some things that may be acceptable, but in general it is not. My take is the whole "large model" approach of "machine learning" has mostly failed.

  • It's not actual 'AI', stop calling it that.
  • In performance and architectural strategy? What are the trade-offs?
  • by fahrbot-bot ( 874524 ) on Sunday November 14, 2021 @04:00PM (#61987953)

    The 2.6 trillion transistors in the WSE-2 are organized into 850,000 cores...

    640k cores should be enough for anyone.

    • Yeah, but somewhere in the back of Cerebras are a couple of CS interns trying to work out how to make the prototype system mine bitcoins, and in the cellar are a bunch of blackhats from the NSA trying to make it crack cryptographic algorithms.

Avoid strange women and temporary variables.

Working...