Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Google Hardware Technology

Google Says Its AI Supercomputer is Faster, Greener Than Nvidia A100 Chip (reuters.com) 28

Alphabet's Google released new details about the supercomputers it uses to train its artificial intelligence models, saying the systems are both faster and more power-efficient than comparable systems from Nvidia. From a report: Google has designed its own custom chip called the Tensor Processing Unit, or TPU. It uses those chips for more than 90% of the company's work on artificial intelligence training, the process of feeding data through models to make them useful at tasks such as responding to queries with human-like text or generating images. The Google TPU is now in its fourth generation. Google on Tuesday published a scientific paper detailing how it has strung more than 4,000 of the chips together into a supercomputer using its own custom-developed optical switches to help connect individual machines.

Improving these connections has become a key point of competition among companies that build AI supercomputers because so-called large language models that power technologies like Google's Bard or OpenAI's ChatGPT have exploded in size, meaning they are far too large to store on a single chip. The models must instead be split across thousands of chips, which must then work together for weeks or more to train the model. Google's PaLM model - its largest publicly disclosed language model to date - was trained by splitting it across two of the 4,000-chip supercomputers over 50 days.

This discussion has been archived. No new comments can be posted.

Google Says Its AI Supercomputer is Faster, Greener Than Nvidia A100 Chip

Comments Filter:
  • Who would have thought that a purpose-built chip would be more energy-efficient than a general-purpose one?

    • Re: (Score:3, Interesting)

      by Anonymous Coward

      You want it fast- code the algorithm in some sort of lower-level compiled language (e.g. C, Rust, etc.)
      You want it really fast - code the algorithm in the processor's assembler.
      You want it really really fast - hard-code the algorithm as a circuit.

      No different than adding h.265, AV1, AES, FFT, etc. to a chip.

      Pretty much every decent digital oscilloscope has a custom ASIC signal processor on the front end for this very reason.

    • The A100 is also purpose-built. It doesn't have a video output.
  • Bard still sucks.
    • It's an impressive bit of engineering compared to what I could write. But yeah, Google is behind the industry in some areas.

      • Which makes absolutely no sense when the hardware that these early GPT models were running on came from Google's.
  • by Artem S. Tashkinov ( 764309 ) on Wednesday April 05, 2023 @02:48PM (#63428528) Homepage

    Google said it did not compare its fourth-generation to Nvidia's current flagship H100 chip because the H100 came to the market after Google's chip and is made with newer technology.

    Google hinted that it might be working on a new TPU that would compete with the Nvidia H100 but provided no details, with Jouppi telling Reuters that Google has "a healthy pipeline of future chips."

    Beside NVIDIA has the edge: it sells universal computing chips and it sells them to everyone. Google's TPU is used primarily (exclusively) for ML training and is not available for anyone but Google. What's the point of comparing then?

    • by UMichEE ( 9815976 ) on Wednesday April 05, 2023 @03:10PM (#63428578)

      The real question is what's the point of Google talking about these things at all? If they're not going to sell the chips to third parties, then why disclose their performance? I guess they're trying to sell their AI platform and want to reassure us that it's computationally very powerful?

      It's kind of impressive that Google can develop its own chips for something (currently) as niche as this. Developing a large digital chip on a cutting edge process node is expensive, even when just considering the fixed costs of wafer masks. The people that do the chip design aren't cheap either (I would know). They must be buying/building so many of these things that the tens-of-millions of dollars fixed costs associated with development will made-up-for with money not paid to nVidia.

      • by Junta ( 36770 ) on Wednesday April 05, 2023 @03:14PM (#63428586)

        They want to brag about GCE exclusive features. They aggressively pursue competing with AWS and Azure, and part of their game plan is to declare how impossibly clever they are and no one else is as clever, and the only way to avail yourself of your cleverness is to buy Google services, because they will not actually let you purchase any of these wonders.

      • by AmiMoJo ( 196126 )

        Many companies have green procurement policies now. Being able to say your AI required less energy to train can be a plus point when competing for contracts.

    • I suspect that Google has two main points in mind: ideally encouraging some ML developers to poke their head in the Google Cloud Services door and have a look(they won't sell you the hardware, at least not without some special arrangement; but they'd be happy to set you up with some VMs that have access to the hardware, same as AWS or Azure); and reminding any analysts who are skeptical of 'bard' vs. 'chatgp' that, unlike OpenAI's stuff, they don't have to cut Nvidia in on whatever the profits are.

      They'd
  • by greytree ( 7124971 ) on Wednesday April 05, 2023 @02:48PM (#63428530)
    ... and it's gone.
  • Given where processors are today, the focus should be on fast and green code. You should see the shit we run on today's processors, it is a fucking abomination. .Net... java... the list goes on, fucking idiots, and it is not getting any better. When you train people to rely on frameworks and auto garbage management, you end up with shit code and it does not matter how powerful the processor you run it on is - it is still a fucking waste of space and energy. Stop being so fucking lazy, get off your node.js,
  • by xwin ( 848234 ) on Wednesday April 05, 2023 @03:42PM (#63428656)
    Google TPU is irrelevant for most people doing ML training or research. You can't purchase the TPU in the machine which you can use. I have a server on the rack which I use every day and that has NVIDIA A100 board in it. The only way to use Google's hardware is to pay for their "arm and leg" plan. We did multiple price evaluation and everyone concluded that it is too expensive to pay for a cloud TPU. If you are not Google and doing this on a regular basis, it is by far cheaper to pay for a system in house. It will pay for itself within a year or two, plus it is CAPEX which is depreciated over 5 years. Not so with cloud payments. They are ongoing, and are OPEX.
    • The problem is if you want to use a large model. For example Bloom, which is about on par with GPT3 size-wise, takes 8x A100. They're $15K each, plus the computer they go in. If you utilize it heavily enough, eventually it will amortize out, but if you're only going to get say 5% utilization (a few people running inference on it sporadically) it probably will never beat renting time before it is obsolete.
      • by xwin ( 848234 ) on Wednesday April 05, 2023 @05:37PM (#63428916)
        The A100 does not cost $15K - https://www.amazon.com/NVIDIA-... [amazon.com] . If you are purchasing it as a part of server it will be even cheaper. Large companies have discounts with outfits like Dell and can get GPU for quite a bit lower.
        You have probably never priced Google cloud for any workloads, it gets quite expensive. Good for startups trying to burn through investors money, not so much for company which is trying to make money.
        Probably OK for internet hosting when you need to scale across regions and provide 24/7 availability. Not so much for ML training. This is example for TPUv4 pricing - $9,402.40 per 1 month with 24H a day use. A relatively simple model like yolo v4 tiny takes about 48H to train on A100. So you can see how the price can quickly add up. This TPU cloud pricing does not count all the storage pricing and all the egress charges that you will pay to transfer your data out. I am currently using free tier for personal projects and even with careful monitoring still get charged here and there.
        For smaller models the A100 is overkill and regular consumer grade GPUs will do the job for much less money. Majority of people working on AI are not training GPT models.
        • That's the 40 gig. The problem with that is 80 x 8 is what you can fit on a chassis so it's kind of assumed for some models, and splitting over 40 x 16 hits the network which is very bad.

          Anyways I agree it's worth running the numbers on what you need... the costs are really significant on these big models, one way or another, and may not be necessary.

  • All they do is stunts to keep people on their services short-term to the ad-revenue keeps flowing in. As soon as the numbers do not look profitable enough, Bard will get the axe.

  • Drunken Busker, I mean Bard, gets drunk more efficiently. Great, I still won't use it.

  • They selected Adobe's "Eco Green" color for their chip.

    Eco Green is a bright green color with a hexadecimal value of #8CC63F.

  • Can I buy one of these Google AI Supercomputers for $8k to use at home?

news: gotcha

Working...