World's Fastest AI Supercomputer Built from 6,159 NVIDIA A100 Tensor Core GPUs

World's Fastest AI Supercomputer Built from 6,159 NVIDIA A100 Tensor Core GPUs (nvidia.com) 57

Posted by EditorDavid on Monday May 31, 2021 @07:34AM from the in-the-chips dept.

Slashdot reader 4wdloop shared this report from NVIDIA's blog, joking that maybe this is where all NVIDIA's chips are going: It will help piece together a 3D map of the universe, probe subatomic interactions for green energy sources and much more. Perlmutter, officially dedicated Thursday at the National Energy Research Scientific Computing Center (NERSC), is a supercomputer that will deliver nearly four exaflops of AI performance for more than 7,000 researchers. That makes Perlmutter the fastest system on the planet on the 16- and 32-bit mixed-precision math AI uses. And that performance doesn't even include a second phase coming later this year to the system based at Lawrence Berkeley National Lab.

More than two dozen applications are getting ready to be among the first to ride the 6,159 NVIDIA A100 Tensor Core GPUs in Perlmutter, the largest A100-powered system in the world. They aim to advance science in astrophysics, climate science and more. In one project, the supercomputer will help assemble the largest 3D map of the visible universe to date. It will process data from the Dark Energy Spectroscopic Instrument (DESI), a kind of cosmic camera that can capture as many as 5,000 galaxies in a single exposure. Researchers need the speed of Perlmutter's GPUs to capture dozens of exposures from one night to know where to point DESI the next night. Preparing a year's worth of the data for publication would take weeks or months on prior systems, but Perlmutter should help them accomplish the task in as little as a few days.

"I'm really happy with the 20x speedups we've gotten on GPUs in our preparatory work," said Rollin Thomas, a data architect at NERSC who's helping researchers get their code ready for Perlmutter. DESI's map aims to shed light on dark energy, the mysterious physics behind the accelerating expansion of the universe.

A similar spirit fuels many projects that will run on NERSC's new supercomputer. For example, work in materials science aims to discover atomic interactions that could point the way to better batteries and biofuels. Traditional supercomputers can barely handle the math required to generate simulations of a few atoms over a few nanoseconds with programs such as Quantum Espresso. But by combining their highly accurate simulations with machine learning, scientists can study more atoms over longer stretches of time. "In the past it was impossible to do fully atomistic simulations of big systems like battery interfaces, but now scientists plan to use Perlmutter to do just that," said Brandon Cook, an applications performance specialist at NERSC who's helping researchers launch such projects. That's where Tensor Cores in the A100 play a unique role. They accelerate both the double-precision floating point math for simulations and the mixed-precision calculations required for deep learning.

World's Fastest AI Supercomputer Built from 6,159 NVIDIA A100 Tensor Core GPUs

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 57 Comments Log In/Create an Account

Comments Filter:

- Re: Mostly useless (Score:2)
  
  by antus ( 6211764 ) writes:
  
  i dont think you either map out the universe with ai, or study sub atomic particles 'for green energy' (is that really a thing?). i may be wrong but i'd assume both sciences require hard facts, not ai approximations creating new information that matches whats known allready (probably there are this many stars here, spaced like this?). ai has its place, but this headline reeks of marketing.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    "AI" cannot do either of these things. The only thing "AI" can do is replicate decisions when there is an ample body of examples, and "AI" always does so with reduced quality.
  - Re: (Score:2)
    
    by Ol Olsoc ( 1175323 ) writes:
    
    i dont think you either map out the universe with ai, or study sub atomic particles 'for green energy' (is that really a thing?).
    I suspect they are talking about zero point energy.
    Which is strange, because many rednecks in the weird area of Youtube have already discovered zpe and perpetual motion.
    - Re: (Score:2)
      
      by Ostracus ( 1354233 ) writes:
      
      It will help piece together a 3D map of the universe, probe subatomic interactions for green energy sources and much more.
      Material science and particle physics. No ZPE required.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    i dont think you either map out the universe with ai, or study sub atomic particles 'for green energy' (is that really a thing?). i may be wrong
    OK, think how data is stored for the purposes of google earth.
    Now, since it doesn't require any math to do, please plot a vector in 3d space to calculate the position of this new object just detected an hour ago, relative to existing map objects velocities and spatial positions, to predict where it should be in 10,11,12,13, and 14 hours from now.
    Oh and we need the results by our next observation window in an hour.
    As for the particle physics, they are brute forcing material properties from adjusting another
  - Re: (Score:1)
    
    by AnilJ ( 1342025 ) writes:
    
    It is marketing or rather sales talk. Scientists always want more money for larger and larger projects. They rationalize this saying it is "basic research". 16 bit and 32 bit calculations? What the heck!!!
- Re:Mostly useless (Score:4, Informative)
  
  by pz ( 113803 ) writes: on Monday May 31, 2021 @08:58AM (#61439322) Journal
  
  Ah, read the article. It has lots of details about what this supercomputer is going to be used for. It isn't making politicians feel good about themselves, it is primarily for rapid, daily analysis of the massively detailed astronomy photos that are coming on line, but there are a good handful of secondary users as well who have already queued up.
  
  - Re: (Score:1)
    
    by gweihir ( 88907 ) writes:
    
    So basically useless except to satisfy some specialized curiosity? Figures.
    - Re:Mostly useless (Score:5, Funny)
      
      by Ostracus ( 1354233 ) writes: on Monday May 31, 2021 @10:02AM (#61439452) Journal
      
      Stand back everyone. If he handwaves any harder he'll blow the forum over.
      
- Re: (Score:2)
  
  by ModelX ( 182441 ) writes:
  
  Sure, some small advances may be made in the misnamed "AI" field, but mostly research there is not a question of computation power and that has been the case for some time. This thing merely serves for some politicians to maintain the illusion they are doing something useful.
  This is not entirely true. A PhD student will be able to take on a problem proportional to computing resources available. This will open up some areas of research that would not be tackled otherwise.
  - Re: (Score:2)
    
    by Ostracus ( 1354233 ) writes:
    
    The summery actually does a good job listing all the use cases of which "AI" is a small part. Deep learning is more descriptive of a larger field. [wikipedia.org] Not that "AI" [wikipedia.org] as a larger body can't be done on this machine; after all it's a computer, just that it's strengths favor some problems more than others.
- Re:Mostly useless (Score:5, Informative)
  
  by godrik ( 1287354 ) writes: on Monday May 31, 2021 @11:09AM (#61439644)
  
  NERSC is mostly doing science using supercomputer rather than research in supercomputers. (Though they also do HPC research, often publish in SC.) Machine Learning is pretty good at solving lots of classic science problems. The best protein folding methods we have today are ML based.
  Astronomy uses a lot of ML nowadays. We generate more astronomy data per day than we can process in a day. So astronomers use ML to classify which part of the data is potentially interesting against parts of the data that is not. That enables narrowing down the amount of data worth looking at with more expensive methods.
  In AI in general, more computational power is useful. Each time we decrease the turnaround time of training a model, it enables a more responsive development cycle. There was a great talk on that at GTC13 by a facebook engineer.
  So yeah, we do need that computational power.
  
- Re: (Score:2)
  
  by Baconsmoke ( 6186954 ) writes:
  
  I'm confused by your statement. Part of AI is Machine Learning, and the speed of Machine Learning is 100% based on computation power. Including the above topic of piecing together astronomical images by the tens or even hundreds of thousands. A slow system will take a ridiculous amount of time working on a project like this. What they built is specifically to do things that Machine Learning is better at than humans. So, even with my limited understanding, it would seem that speed is extremely important.
- Re: (Score:2)
  
  by LifesABeach ( 234436 ) writes:
  
  ok.
  but can it give me 120 frames per second when i play call of duty war zone.
  otherwise.
  so what
In other news (Score:5, Informative)

by DrMrLordX ( 559371 ) writes: on Monday May 31, 2021 @07:42AM (#61439142)

The real news here is that while Perlmutter was completed mostly on time, Aurora - featuring Intel's Ponte Vecchio accelerators - was not. Perlmutter features AMD EPYC 7003 CPUs:
https://www.amd.com/en/press-r... [amd.com]
Here we have AMD and nVidia rolling out their hardware more-or-less on schedule while Intel continues to struggle to produce anything meaningful in the HPC market. Or the cloud/hyperscalar market. Or really any market other than 4c laptops. Speculation was that Frontier might go online before Aurora, and that's looking to be increasingly likely:
https://www.hpcwire.com/2020/1... [hpcwire.com]

- Re: (Score:3)
  
  by DrMrLordX ( 559371 ) writes:
  
  Actually they're 7003-series CPUs: specifically, the EPYC 7763.
- Re: (Score:1)
  
  by Black Parrot ( 19622 ) writes:
  
  With maybe a bit of cryptocurrency mining on the side.
WHy (Score:1)

by AlexHilbertRyan ( 7255798 ) writes:

Why 6159 ? What a strange number.
- Re: (Score:2)
  
  by Jamu ( 852752 ) writes:
  
  3 x 2k + 15
  
  +15 for management of the others?
- Re: (Score:2)
  
  by Gabest ( 852807 ) writes:
  
  That's 6159 kids with no GPU because all the silicon went into this machine.
  - Re: (Score:2)
    
    by ZiggyZiggyZig ( 5490070 ) writes:
    
    "Kids" :D
- Re: (Score:3)
  
  by godrik ( 1287354 ) writes:
  
  Why 6159 ? What a strange number.
  Not really. None of these systems have been built in powers of two for about 20 years. Really people think in terms of cabinets and what you fit in there. The specs of the machine is here:
  https://docs.nersc.gov/systems... [nersc.gov]
  It is 1536 compute nodes in 12 cabinets with 4 GPUs each. and one GPU in each of the 15 login nodes.
How? (Score:2)

by TheDarkMaster ( 1292526 ) writes:

Just curious: How is it done to link so many GPUs in one computational unit?
- Re: How? (Score:2)
  
  by IdanceNmyCar ( 7335658 ) writes:
  
  Your question seems to lack specificity. It's as if you are asking how data is shared. Maybe it's just me but the HPC answer to this is pretty straightforward.
  State is shared as infrequently as possible by finding problems that can be easily divided into smaller sets of work with many computations. I didn't RTFA but systems with unique shared memory designs are generally less common. Since these kinds of problems require state to be distributed to individual machines rarely, the network might not even n
  - Re: (Score:2)
    
    by TheDarkMaster ( 1292526 ) writes:
    
    Yes,I was actually thinking in general terms. In the sense of how do you get so many GPUs to act as a single "computer".
    - Re: (Score:2)
      
      by SirSlud ( 67381 ) writes:
      
      The fact that you put quotes around computer means you're aware that the idea of multiple computers acting as a single computer is a subjective judgement. Is it not enough to say they're working on a common problem and the GPUs don't require their working sets of data to be input and/or collected manually? Folding @ home was commonly described as a computer because it was a bunch of computers independently working on the same problem, which was coordinated/managed from one place. These are pretty abstract c
      - Re: (Score:2)
        
        by TheDarkMaster ( 1292526 ) writes:
        
        Well, what I am asking is more at the technical level, I am aware that physically it is a very large set of individual computers. What I’m curious about is how it’s done for them to act as a unit
    - Re: (Score:2)
      
      by ceoyoyo ( 59147 ) writes:
      
      You don't. It's a typical cluster, including the GPUs. If you want to use more than one GPU at a time you need to write your code to do so.
      This works pretty well for deep learning because training involves repeatedly showing a bunch of examples and computing gradients. You can run that in parallel and just average the gradients with only a slight loss in efficiency.
      - Re: (Score:2)
        
        by TheDarkMaster ( 1292526 ) writes:
        
        Let's see if I got it right. So it's just a bunch of networked (or something else that connects them together) common computers that can only operate if the code used is meant to be distributed?
        
        Re: (Score:3)
        
        by godrik ( 1287354 ) writes:
        
        Essentially yes.
        It is similar to what you would build at home by connecting a few machines with GPUs with an ethernet cable. Now they are using "fancy" ethernet that enable GPU to GPU communications even if they are not in the same compute node. (They call it RDMA.)
        You program a system like this very similarly to how you would program any distributed application. Usually MPI or Hadoop Spark.
        Actually, most multi GPU applications even if they are sitting in the same machine are programmed in a distributed way
        
        Re: (Score:2)
        
        by ceoyoyo ( 59147 ) writes:
        
        Typically you have a few to a few dozen cores plus several GPUs in a node, all of which are connected by a very fast bus, like a single "computer." Then you usually connect the nodes with a fast interconnect. It looks like Perlmutter uses Cray Slingshot for that, which looks sort of like ethernet, but 200 Gb/s, 1.2 billion packets/s, and with switches that can do 12.8 Tb/s.
        But yes, all of the top supercomputers are clusters.
        
        Re: How? (Score:2)
        
        by IdanceNmyCar ( 7335658 ) writes:
        
        One thing I think others haven't mentioned that might better fill in the gap is the existence of a Job Manager. The Supercomputer is effectively a bunch of machines networked together. They mentioned in the summary and expansion still planned which sounded to be a different site, so what we call a supercomputer is really a bunch of networked machines you can leverage together to solve a problem. Those machines are generally a single location but I don't think it's a requirement. The machines are leveraged
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  The same way we linked machines together back in the nineties, by carving a job up into smaller pieces and handing those pieces off to nodes. Back then it was predominantly DQS, a nifty system that would send your job to nodes with the necessary keywords to support them automatically. Today I have no idea what software is used specifically, though DQS still works :)
  - Re: (Score:2)
    
    by OrangeTide ( 124937 ) writes:
    
    Kubernetes calls them labels. But you jobs are containers and are distributed according to how you've labeled the nodes in your cluster. Standing on the shoulders of giants ...
- Re: (Score:2)
  
  by OrangeTide ( 124937 ) writes:
  
  RMDA (remote direct memory access) plays a big part [nvidia.com] of their current HPC lineup. Basically one GPU can access another GPU's memory over a switched fabric. For GPUs that are close by, such as in the same system or at least the same rack, there is NVLink [nvidia.com], which would have a much lower latency than an RDMA.
  The hard part of course is in the software making good decisions on where to keep data so that it is not costly to fetch when it is needed.
named after Saul Permutter (Score:3)

by pz ( 113803 ) writes: on Monday May 31, 2021 @08:55AM (#61439314) Journal

From the article:
Dark energy was largely discovered through the 2011 Nobel Prize-winning work of Saul Perlmutter, a still-active astrophysicist at Berkeley Lab who will help dedicate the new supercomputer named for him.

- Re: named after Saul Permutter (Score:2)
  
  by NateFromMich ( 6359610 ) writes:
  
  Pretty impressive, discovering something that's hypothetical.
- I had to look it up (Score:2)
  
  by Ecuador ( 740021 ) writes:
  
  I was going to write the same thing, as I had to look it up - I had found it quite odd they would make a supercomputer to run Perl.
- Re: (Score:1)
  
  by Shag ( 3737 ) writes:
  
  I used to be affiliated with the Lab and took a lot of data for the "follow-on" research mentioned in the press release, collaborating with Saul and Rollin and a few dozen other people, but I'm used to living people getting things like this named after them, so I scrambled to Google, fearing he had died. Glad to hear he is indeed still alive, and got to kick off the first compute job.
Sounds perfect... (Score:2)

by Baconsmoke ( 6186954 ) writes:

for my render/modelling needs. I could probably render my entire effort of around 27,000 images in a few seconds. That would be pretty nice.
The AI is alive (Score:2)

by nospam007 ( 722110 ) * writes:

It mills Bitcoin on the side for its dark purposes.
The next NERSC announcement: (Score:1)

by Salton Pepper ( 6245830 ) writes:

"Cutting edge supercomputer brought to it's knees when attempting to run Crysis". Sorry, I'm a little bit bored at the moment.
but can it play doom? (Score:1)

by zeiche ( 81782 ) writes:

it might be better used by a kid to play fortnight.
A Beuwolf Cluster (Score:2)

by flyingfsck ( 986395 ) writes:

I really would like to have a cluster of those!
That explains the gpu shortage? (Score:1)

by Cnox ( 6973744 ) writes:

So that's where my god damn GPU stock went....
Who wants to be the person to explain ... (Score:3)

by Babel-17 ( 1087541 ) writes: on Monday May 31, 2021 @10:55PM (#61441472)

Who wants to explain the chip shortage to the AI when it asks its developers to build it a bride/husband/significant other?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re: Mostly useless (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re:Mostly useless (Score:4, Informative)

Re: (Score:1)

Re:Mostly useless (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re:Mostly useless (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

In other news (Score:5, Informative)

Re: (Score:3)

Re: (Score:1)

WHy (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

How? (Score:2)

Re: How? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: How? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

named after Saul Permutter (Score:3)

Re: named after Saul Permutter (Score:2)

I had to look it up (Score:2)

Re: (Score:1)

Sounds perfect... (Score:2)

The AI is alive (Score:2)

The next NERSC announcement: (Score:1)

but can it play doom? (Score:1)

A Beuwolf Cluster (Score:2)

That explains the gpu shortage? (Score:1)

Who wants to be the person to explain ... (Score:3)

Related Links Top of the: day, week, month.

Slashdot Top Deals