Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Supercomputing Space Hardware Science

World's Fastest AI Supercomputer Built from 6,159 NVIDIA A100 Tensor Core GPUs (nvidia.com) 57

Slashdot reader 4wdloop shared this report from NVIDIA's blog, joking that maybe this is where all NVIDIA's chips are going: It will help piece together a 3D map of the universe, probe subatomic interactions for green energy sources and much more. Perlmutter, officially dedicated Thursday at the National Energy Research Scientific Computing Center (NERSC), is a supercomputer that will deliver nearly four exaflops of AI performance for more than 7,000 researchers. That makes Perlmutter the fastest system on the planet on the 16- and 32-bit mixed-precision math AI uses. And that performance doesn't even include a second phase coming later this year to the system based at Lawrence Berkeley National Lab.

More than two dozen applications are getting ready to be among the first to ride the 6,159 NVIDIA A100 Tensor Core GPUs in Perlmutter, the largest A100-powered system in the world. They aim to advance science in astrophysics, climate science and more. In one project, the supercomputer will help assemble the largest 3D map of the visible universe to date. It will process data from the Dark Energy Spectroscopic Instrument (DESI), a kind of cosmic camera that can capture as many as 5,000 galaxies in a single exposure. Researchers need the speed of Perlmutter's GPUs to capture dozens of exposures from one night to know where to point DESI the next night. Preparing a year's worth of the data for publication would take weeks or months on prior systems, but Perlmutter should help them accomplish the task in as little as a few days.

"I'm really happy with the 20x speedups we've gotten on GPUs in our preparatory work," said Rollin Thomas, a data architect at NERSC who's helping researchers get their code ready for Perlmutter. DESI's map aims to shed light on dark energy, the mysterious physics behind the accelerating expansion of the universe.

A similar spirit fuels many projects that will run on NERSC's new supercomputer. For example, work in materials science aims to discover atomic interactions that could point the way to better batteries and biofuels. Traditional supercomputers can barely handle the math required to generate simulations of a few atoms over a few nanoseconds with programs such as Quantum Espresso. But by combining their highly accurate simulations with machine learning, scientists can study more atoms over longer stretches of time. "In the past it was impossible to do fully atomistic simulations of big systems like battery interfaces, but now scientists plan to use Perlmutter to do just that," said Brandon Cook, an applications performance specialist at NERSC who's helping researchers launch such projects. That's where Tensor Cores in the A100 play a unique role. They accelerate both the double-precision floating point math for simulations and the mixed-precision calculations required for deep learning.

This discussion has been archived. No new comments can be posted.

World's Fastest AI Supercomputer Built from 6,159 NVIDIA A100 Tensor Core GPUs

Comments Filter:
  • In other news (Score:5, Informative)

    by DrMrLordX ( 559371 ) on Monday May 31, 2021 @06:42AM (#61439142)

    The real news here is that while Perlmutter was completed mostly on time, Aurora - featuring Intel's Ponte Vecchio accelerators - was not. Perlmutter features AMD EPYC 7003 CPUs:

    https://www.amd.com/en/press-r... [amd.com]

    Here we have AMD and nVidia rolling out their hardware more-or-less on schedule while Intel continues to struggle to produce anything meaningful in the HPC market. Or the cloud/hyperscalar market. Or really any market other than 4c laptops. Speculation was that Frontier might go online before Aurora, and that's looking to be increasingly likely:

    https://www.hpcwire.com/2020/1... [hpcwire.com]

  • Why 6159 ? What a strange number.
  • Just curious: How is it done to link so many GPUs in one computational unit?
    • Your question seems to lack specificity. It's as if you are asking how data is shared. Maybe it's just me but the HPC answer to this is pretty straightforward.

      State is shared as infrequently as possible by finding problems that can be easily divided into smaller sets of work with many computations. I didn't RTFA but systems with unique shared memory designs are generally less common. Since these kinds of problems require state to be distributed to individual machines rarely, the network might not even n

      • Yes,I was actually thinking in general terms. In the sense of how do you get so many GPUs to act as a single "computer".
        • by SirSlud ( 67381 )

          The fact that you put quotes around computer means you're aware that the idea of multiple computers acting as a single computer is a subjective judgement. Is it not enough to say they're working on a common problem and the GPUs don't require their working sets of data to be input and/or collected manually? Folding @ home was commonly described as a computer because it was a bunch of computers independently working on the same problem, which was coordinated/managed from one place. These are pretty abstract c

          • Well, what I am asking is more at the technical level, I am aware that physically it is a very large set of individual computers. What I’m curious about is how it’s done for them to act as a unit
        • by ceoyoyo ( 59147 )

          You don't. It's a typical cluster, including the GPUs. If you want to use more than one GPU at a time you need to write your code to do so.

          This works pretty well for deep learning because training involves repeatedly showing a bunch of examples and computing gradients. You can run that in parallel and just average the gradients with only a slight loss in efficiency.

          • Let's see if I got it right. So it's just a bunch of networked (or something else that connects them together) common computers that can only operate if the code used is meant to be distributed?
            • by godrik ( 1287354 )

              Essentially yes.
              It is similar to what you would build at home by connecting a few machines with GPUs with an ethernet cable. Now they are using "fancy" ethernet that enable GPU to GPU communications even if they are not in the same compute node. (They call it RDMA.)

              You program a system like this very similarly to how you would program any distributed application. Usually MPI or Hadoop Spark.

              Actually, most multi GPU applications even if they are sitting in the same machine are programmed in a distributed way

            • by ceoyoyo ( 59147 )

              Typically you have a few to a few dozen cores plus several GPUs in a node, all of which are connected by a very fast bus, like a single "computer." Then you usually connect the nodes with a fast interconnect. It looks like Perlmutter uses Cray Slingshot for that, which looks sort of like ethernet, but 200 Gb/s, 1.2 billion packets/s, and with switches that can do 12.8 Tb/s.

              But yes, all of the top supercomputers are clusters.

            • One thing I think others haven't mentioned that might better fill in the gap is the existence of a Job Manager. The Supercomputer is effectively a bunch of machines networked together. They mentioned in the summary and expansion still planned which sounded to be a different site, so what we call a supercomputer is really a bunch of networked machines you can leverage together to solve a problem. Those machines are generally a single location but I don't think it's a requirement. The machines are leveraged

    • The same way we linked machines together back in the nineties, by carving a job up into smaller pieces and handing those pieces off to nodes. Back then it was predominantly DQS, a nifty system that would send your job to nodes with the necessary keywords to support them automatically. Today I have no idea what software is used specifically, though DQS still works :)

      • Kubernetes calls them labels. But you jobs are containers and are distributed according to how you've labeled the nodes in your cluster. Standing on the shoulders of giants ...

    • RMDA (remote direct memory access) plays a big part [nvidia.com] of their current HPC lineup. Basically one GPU can access another GPU's memory over a switched fabric. For GPUs that are close by, such as in the same system or at least the same rack, there is NVLink [nvidia.com], which would have a much lower latency than an RDMA.
      The hard part of course is in the software making good decisions on where to keep data so that it is not costly to fetch when it is needed.

  • by pz ( 113803 ) on Monday May 31, 2021 @07:55AM (#61439314) Journal

    From the article:

    Dark energy was largely discovered through the 2011 Nobel Prize-winning work of Saul Perlmutter, a still-active astrophysicist at Berkeley Lab who will help dedicate the new supercomputer named for him.

    • Pretty impressive, discovering something that's hypothetical.
    • I was going to write the same thing, as I had to look it up - I had found it quite odd they would make a supercomputer to run Perl.

    • by Shag ( 3737 )

      I used to be affiliated with the Lab and took a lot of data for the "follow-on" research mentioned in the press release, collaborating with Saul and Rollin and a few dozen other people, but I'm used to living people getting things like this named after them, so I scrambled to Google, fearing he had died. Glad to hear he is indeed still alive, and got to kick off the first compute job.

  • for my render/modelling needs. I could probably render my entire effort of around 27,000 images in a few seconds. That would be pretty nice.
  • It mills Bitcoin on the side for its dark purposes.

  • "Cutting edge supercomputer brought to it's knees when attempting to run Crysis". Sorry, I'm a little bit bored at the moment.
  • it might be better used by a kid to play fortnight.

  • I really would like to have a cluster of those!
  • So that's where my god damn GPU stock went....

  • by Babel-17 ( 1087541 ) on Monday May 31, 2021 @09:55PM (#61441472)
    Who wants to explain the chip shortage to the AI when it asks its developers to build it a bride/husband/significant other?

Some people manage by the book, even though they don't know who wrote the book or even what book.

Working...