Google Says Its AI Supercomputer is Faster, Greener Than Nvidia A100 Chip (reuters.com) 28
Alphabet's Google released new details about the supercomputers it uses to train its artificial intelligence models, saying the systems are both faster and more power-efficient than comparable systems from Nvidia. From a report: Google has designed its own custom chip called the Tensor Processing Unit, or TPU. It uses those chips for more than 90% of the company's work on artificial intelligence training, the process of feeding data through models to make them useful at tasks such as responding to queries with human-like text or generating images. The Google TPU is now in its fourth generation. Google on Tuesday published a scientific paper detailing how it has strung more than 4,000 of the chips together into a supercomputer using its own custom-developed optical switches to help connect individual machines.
Improving these connections has become a key point of competition among companies that build AI supercomputers because so-called large language models that power technologies like Google's Bard or OpenAI's ChatGPT have exploded in size, meaning they are far too large to store on a single chip. The models must instead be split across thousands of chips, which must then work together for weeks or more to train the model. Google's PaLM model - its largest publicly disclosed language model to date - was trained by splitting it across two of the 4,000-chip supercomputers over 50 days.
Improving these connections has become a key point of competition among companies that build AI supercomputers because so-called large language models that power technologies like Google's Bard or OpenAI's ChatGPT have exploded in size, meaning they are far too large to store on a single chip. The models must instead be split across thousands of chips, which must then work together for weeks or more to train the model. Google's PaLM model - its largest publicly disclosed language model to date - was trained by splitting it across two of the 4,000-chip supercomputers over 50 days.
Purpose-built more energy-efficient (Score:2)
Who would have thought that a purpose-built chip would be more energy-efficient than a general-purpose one?
Re: (Score:3, Interesting)
You want it fast- code the algorithm in some sort of lower-level compiled language (e.g. C, Rust, etc.)
You want it really fast - code the algorithm in the processor's assembler.
You want it really really fast - hard-code the algorithm as a circuit.
No different than adding h.265, AV1, AES, FFT, etc. to a chip.
Pretty much every decent digital oscilloscope has a custom ASIC signal processor on the front end for this very reason.
Re: (Score:3)
Re: Purpose-built more energy-efficient (Score:1)
what does GP in GPGPU mean?
And yet... (Score:1)
Re: (Score:2)
It's an impressive bit of engineering compared to what I could write. But yeah, Google is behind the industry in some areas.
Re: (Score:1)
The last two paragraphs (Score:5, Funny)
Beside NVIDIA has the edge: it sells universal computing chips and it sells them to everyone. Google's TPU is used primarily (exclusively) for ML training and is not available for anyone but Google. What's the point of comparing then?
Re:The last two paragraphs (Score:4, Insightful)
The real question is what's the point of Google talking about these things at all? If they're not going to sell the chips to third parties, then why disclose their performance? I guess they're trying to sell their AI platform and want to reassure us that it's computationally very powerful?
It's kind of impressive that Google can develop its own chips for something (currently) as niche as this. Developing a large digital chip on a cutting edge process node is expensive, even when just considering the fixed costs of wafer masks. The people that do the chip design aren't cheap either (I would know). They must be buying/building so many of these things that the tens-of-millions of dollars fixed costs associated with development will made-up-for with money not paid to nVidia.
Re:The last two paragraphs (Score:4, Interesting)
They want to brag about GCE exclusive features. They aggressively pursue competing with AWS and Azure, and part of their game plan is to declare how impossibly clever they are and no one else is as clever, and the only way to avail yourself of your cleverness is to buy Google services, because they will not actually let you purchase any of these wonders.
Re: (Score:2)
Many companies have green procurement policies now. Being able to say your AI required less energy to train can be a plus point when competing for contracts.
Re: (Score:2)
They'd
My org will switch to this cool new Google tool! (Score:3, Funny)
Code, not processor. (Score:1)
Re:Code, not processor. (Score:4, Informative)
Just make everyone spend a year working on a small embedded system. They'll never take infinite resources for granted again.
Google TPU is irrelevant for most people (Score:5, Insightful)
Re: (Score:2)
Re:Google TPU is irrelevant for most people (Score:4, Informative)
You have probably never priced Google cloud for any workloads, it gets quite expensive. Good for startups trying to burn through investors money, not so much for company which is trying to make money.
Probably OK for internet hosting when you need to scale across regions and provide 24/7 availability. Not so much for ML training. This is example for TPUv4 pricing - $9,402.40 per 1 month with 24H a day use. A relatively simple model like yolo v4 tiny takes about 48H to train on A100. So you can see how the price can quickly add up. This TPU cloud pricing does not count all the storage pricing and all the egress charges that you will pay to transfer your data out. I am currently using free tier for personal projects and even with careful monitoring still get charged here and there.
For smaller models the A100 is overkill and regular consumer grade GPUs will do the job for much less money. Majority of people working on AI are not training GPT models.
Re: (Score:2)
Anyways I agree it's worth running the numbers on what you need... the costs are really significant on these big models, one way or another, and may not be necessary.
Do not rely on anything Google (Score:2)
All they do is stunts to keep people on their services short-term to the ad-revenue keeps flowing in. As soon as the numbers do not look profitable enough, Bard will get the axe.
drunken busker (Score:2)
Drunken Busker, I mean Bard, gets drunk more efficiently. Great, I still won't use it.
I'm sure (Score:2)
They selected Adobe's "Eco Green" color for their chip.
Eco Green is a bright green color with a hexadecimal value of #8CC63F.
All Well and Good but... (Score:2)