NVIDIA Creates a 15B-Transistor Chip With 16GB Bandwidth Memory For Deep Learning (venturebeat.com) 128
An anonymous reader cites a report on VentureBeat: NVIDIA chief executive Jen-Hsun Huang announced that the company has created a new chip, the Tesla P100, with 15 billion transistors, 16GB high-bandwidth memory for deep-learning computing. It's the biggest chip ever made, Huang said. "We decided to go all-in on A.I.," Huang said. "This is the largest FinFET chip that has ever been done." The chip has 15 billion transistors, or three times as much as many processors or graphics chips on the market. It takes up 600 square millimeters. The chip can run at 21.2 teraflops. Huang said that several thousand engineers worked on it for years. Jim McGregor, writing for Forbes (the link is not accessible to ad-blocking tool users): It features NVIDIA's new Pascal GPU architecture, the latest memory and semiconductor process, and packaging technology -- all to create the densest compute platform to date. In addition, it combines 16GB of die stacked second-generation High-Bandwidth Memory (HBM2). The memory and GPU are combined into a multichip module on a state-of-the-art silicon substrate. The P100 has NVIDIA's NVLink interface technology to connect to multiple Tesla P100 GPU modules.
Welcome SkyNET overlords! (Score:3, Funny)
Please enjoy hunting me with your time machine.
Re: (Score:2)
Please enjoy hunting me with your time machine.
https://xkcd.com/652 [xkcd.com].
Re: (Score:2)
(spoiler) Since the Doctor is Skynet, he already has a time machine.
So, (Score:2)
1000 engineers (Score:2)
I'm always amazed how it takes so many engineers. What the heck do they all do? How does one organize this many contributions? Isn't this sort of they highly automated with largely repetitive subunits.
Re: (Score:2)
Dr. Ellie Sattler: Women inherits the earth.
Re: (Score:3)
(with apologies to Michael Crichton)
Ian - God creates intelligence, god destroys intelligence. God creates man, man destroys god. Man create AI.
Ellie - AI destroys man, women inherit the earth...
Perhaps a different, more historical view from the 1950's http://www.alteich.com/oldsite... [alteich.com]
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel. Dwar Ev stepped back and drew a deep breath. "The honor of asking the first question is yours, Dwar Reyn."
"Thank you," said Dwar Reyn. "It shall be a question that no single cybernetics machine has been able to answer." He turned to face the machine. "Is there a God?"
The mighty voice answered without hesitation, without the clicking of single relay.
"Yes, now there is a God."
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
Re: (Score:2)
The solution to hard AI will not be direct.
Re:1000 engineers (Score:5, Informative)
There are several factors. First of all, what they are building is a HUGE engineered system which would have taken up a couple of buildings a decade or two ago. The fact that the end product is small doesn't change the complexity. The second part is the fact that it IS so small, which brings its own complications. In addition, semiconductor manufacturing is a very tricky business where even making the simplest thing (e.g., a transistor) takes an enormous amount of planning, characterization, and tool design.
Part of it is the R&D -- nothing like this has been done before, so certain things have to be figured out (heat dissipation, how the proximity of the components effect the other components,stuff neither of us will understand, etc. etc). Another huge part is tooling and process -- someone has to design, test and characterize the fabrication tools and processes (the "automation" you speak of has to be built by someone -- a device this complicated probably can't be built without the automation). The chip is divided into subsystems each of which needs to be designed, simulated, and optimized. Someone has to integrate all the subsystems and simulate them together. The 1000 people probably include material scientists, process engineers, electrical engineers of various stripes, semiconductor physicists, mechanical engineers (heat dissipation, packaging, etc)., systems engineers, engineering project managers, etc.
Re: (Score:2)
You could probably train the AI to play Quake 1 pretty effectively.
Re: (Score:2)
Re:1000 engineers (Score:5, Informative)
One organizes many contributions using any number of industry-standard design methodologies. Designing airplanes and cars uses even more engineers.
I suspect NVIDIA is slightly exaggerating and are counting the contribution of many "overhead" engineers that provide value for the whole engineering organization, such as people who work on design tools, design kits, methodology and the like.
You're right, there are many repeated subunit but each unit needs a team to be optimized.
For a chip this complex you need:
Logic Designers (who come up with high-level models for the chip and define the instruction set / hardware interface)
Front-end engineers that write Verilog and/or VHDL (I have no idea what NVIDIA uses)
Implementation engineers (who do place and route and parasitic extraction)
Verification engineers (who use various tools to see if everything is as it should be)
Packaging engineers (who work closely with vendors to develop a custom package for the chip/module)
Module engineers (since we have 3D stacked memories on this device the module engineering is far from trivial)
Thermal Engineers (3D modules typically have very complex thermal requirements)
Signal Integrity engineers (since we're going so fast just getting a signal from point A to point B is hard)
Analog/Mixed Signal engineers (for clocking, serial I/O development)
Integration Engineers (for modeling how to put all this together)
System Engineers (for figuring out if this is all going to work)
Software Engineers (for low-level software dev)
CAD Engineers (for developing and maintaining an appropriate computer-aided design flow)
Foundry Engineers (for working with the foundry on the physical production of the wafers... anything this big and complex will need process customization)
ESD engineers (for figuring out and implementing an ESD strategy)
Library Engineers (for customizing and optimizing the standard cell library used in the chip)
Product Engineers (for solving production problems as they arise)
Test Engineers (for developing and implementing tests to show the chip is working as expecting)
Application Engineers (who work with early adopters to integrate this chip into their systems)
and on and on and on...
As you can see, an army of engineers is required for a chip this complex to see the light of day. On simpler chips, many of these roles can be played by the same people, but in a chip this big, they need to divide the work or it would never get done.
Re: (Score:2)
For a chip this complex you need:
How many of those job titles and descriptions actually correspond to a college major that an American citizen can learn?
It's a problem mostly seen in the U.S., say labor-market experts, thanks to a rapidly evolving economy and a divide between the country's educational institutions and employers that isn't there in other advanced economies. In Germany and Denmark, for example, the two groups collaborate to ensure training and apprenticeships lead to jobs after graduation. The gap has helped push U.S. job vacancies to 5.5 million, near historic highs. For most of the past year the number of job openings has exceeded the number of new hires, a reflection of employers' difficulty in filling positions.
http://www.wsj.com/articles/colleges-drill-down-on-job-listing-terms-1459704268 [wsj.com]
Re: (Score:2)
I see what you're getting at and I don't disagree. However, as you know this stuff is really complicated and you need to be specialized in your career to be effective.Most of these jobs are for Electrical Engineers, a few could be also held by people who studied Computer Science or Mechanical Engineering. I'm an Analog/Mixed-Signal Engineer and while I know Verilog and how to run verification tools, I'm frankly not as competent at those roles as specialists are. It is the way of the world.
I agree you coul
Re: (Score:2)
They almost all correspond to an electrical or computer engineering major, however the vast majority are not really available to new grads. Three or four on that list will accept new grads who have already specialized a bit in their masters programs, and then once they have a better handle on the big picture they can transfer into other roles.
For example, you don't design systems without k
Re: (Score:1)
Wow, "and on and on and on..."
It is amazing how much I take for granted.
Re: (Score:2)
I've actually been wondering and asking, for a while now, why we can't just buy like a 5 cubic inch block of CPU and stuff it in our computers. Yes, I know it will get hot. Yes, I know it'll suck down juice like an arc welder. I'm okay with that. I've got solar and wind. I can cool that down - there are things to do that with.
Seriously, am I the only one that envisions a 5^3" chunk of CPU and all the glorious things I could do with it? Coupled with stacks of those NVRAM critters, piped right next to it, and
Re: (Score:2)
except you can't remove the heat from inside the block. The reason cpu's are basically 2D flat pieces is so we can glue on cooling fins, and cool that bad boy down. until we figure out how to do micro liquid cooling, under pressure, and interweave that cooling into the chip's design we won't get real 3D stacks of processors.
Though What we can do is create 4-6 cpu's stacked vertically around a coolant tube. but I believe that run a foul of pushing parts of the pct to far away from the rest of the componen
Re: (Score:2)
Re: (Score:3)
Only if you have admin privileges, are s superuser, and enter the right password.
Or say "sudo make me a sandwich". :)
It works on geek girls, anyway, from what I hear.
Now We Know Why Drivers Suck (Score:2)
Re: (Score:2)
Re: (Score:2)
SLI-ed GPUs can't even share VRAM(with some lim
Re: (Score:2)
SLI is an acronym, it stands for Scan Line Interleave. What this means, is that each GPU does half the screen worth of work by running a line at a time on each GPU. I am not sure what the OP's issue with drivers is, but my assumption would be the age of the hardware, 6xx is pretty old, and might not be enough to support modern games anymore. I highly doubt the OP's issue is with the driver having parallelization problems.
Re: (Score:2)
SLI was Scan Line Interleave back in 1998. (and the Voodoo5 used small horizontal bands of pixels rather than straight scanlines). Then nvidia decided to revived it years later, but deciding the letters would stand for Scalable Link Interface, which doesn't really mean anything in particular. It never interleaved scanlines anymore.
Re: (Score:2)
One item on a long, long list.
Re: (Score:2)
Yes, let's have the chip designers drop everything and help write the device drivers. Brilliant. I should have the network engineers come upstairs and help me with my Excel spreadsheets.
Re: (Score:2)
Make sure you have a couple of them so they can argue amoung themselves and find an agreement.
Worst case sent randoms of them off to fetch you a coffee!
Re: (Score:2)
Re: (Score:2)
Oh God, okay, I suppose you're right, and they had software engineers design this new chip instead of actual chip designers, thus stealing the precious resources you're bitching about.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2, Informative)
It's a FinFET device. You can represent more than 1 binary bit per transistor by using multi-gate transistors.
This is not a factually correct statement. Multi-gate transistors are used because they are more energy-efficicient, perform better, and can be scaled to smaller dimensions than traditional planar CMOS devices. The extra gates give better electrostatic control over the MOSFET channel, but they do not allow the device to perform operations on more than one bit of data at once.
https://en.wikipedia.org/wiki/Multigate_device
Re:15B transistors = 16 GB ? (Score:5, Informative)
Oh, for God's sake, I ignored this at first but now it's been modded up.
15 billion is the transistor count for the GPU logic. It's not the transistor count for the HMB2 memory installed alongside the GPU on the interposer. Adding an interposer does not suffice to make it all the same chip (hint from TFS: "multichip module").
FinFET is neither necessary nor sufficient to for multi-level-cell-like bit representation. That's also a flash storage technology, not a logic or volatile memory technology (at least in mass produced products).
It's 15 days to Weed Day. Put down whatever you're smoking and get back to work.
Re: (Score:2)
I've seen nothing that would indicate either way that this could be the truth. It's pure speculation. Others have speculated that to reach that 15 billion number they have to be counting the memory transistors as well. Though this is big at 600mm2 it isn't that much bigger than previous die's that held a fraction of that number of transitions.
Re: (Score:2)
It has about 50% more transistors than the Oracle Spark M7 at 10.2 billion so the increase is reasonable.
Re: (Score:2)
You do realize that 15 billion transistors, if you assume that each holds one bit of stored information (HA!), is less than 2GB of storage?
BTW, it's not speculation, it's from NVIDIA's own press release.
Re: (Score:2)
There are two sets of memory on this chip if you read the reports. The die itself has an additional layer that is HBM (High bandwidth memory) linked directly to the CPU. Think of it like the L1 and L2 cache in x86 chips. There is nothing that indicates how much memory this is (as the quoted memory sizes are for the memory chips attached to the boards). I'm willing to bet the chip has around 10billion transistors and the remaining 5 are the HBM layer that sits on top.
Re: (Score:2)
There are only two sets of memory if you consider the register file to be memory instead of cache (which you apparently do). The problem is, the published specifications demonstrate that you are simply wrong.
4MB of L2 cache and ~14MB of register file space per GPU [nvidia.com] means that there is about 151 million bits associated with cache and "memory." On a chip with 15.3 billion transistors, that comfortably means that you have about 15 billion transistors for GPU logic.
There is everything to indicate the specs of
Re: (Score:2)
Since they are talking about bandwidth, I would guess that what they really mean is 16 GB/s. Although I don't see any reference to bandwidth in the article and the only reference I see to 16 is the 16 nm fabrication process.
Re: (Score:1)
The chip has 15B transistors and no RAM. (it does have 4MB L2 cache and 14MB worth of register files)
The entire Tesla P100 package is comprised of many chips not just the GPU, that collectively add up to over 150 billion transistors and features 16GB of stacked HBM2 VRAM.
Re: (Score:2)
I can address 16GByte with only 44 bits. that leaves more than 14.9B transistors left over to do whatever.
Traditional early adopter killer app (Score:2)
This should provide some astonishing porn.
Re: (Score:2)
LOL, why am I suddenly picturing millions of horny guys getting blown off by a porn AI which has developed attitude?
Except for the people into that whole humiliation thing, I just don't see that being a big selling point. :-P
I think sentient porn is the last thing we want.
Re: (Score:2)
Well, it could learn to categorize it and learn your preferences very fast, so it's not out of the question...
The P100 was already discovered.. (Score:2)
Re: (Score:1)
The reference was P100D, so maybe the car is using two of them?
most nVidia enginners work on all projects (Score:4, Informative)
From what my friends who work at nVidia tell me, most engineers work on all projects. They get sent problems from one GPU, after fixing that, start working on issues from a CPU or some other project.
Units (Score:1)
16GigaBillion
somewhat deceiving numbers.... (Score:5, Informative)
Re:somewhat deceiving numbers.... (Score:5, Informative)
It turns out that for deep learning, 1/2 precision is very commonly used. You are using floats for numbers in a fairly small range, and accuracy isn't key. half precision speeds up processing, and more importantly lets you work with twice as much data.
Re: (Score:2)
Re: (Score:2)
Is there really any advantage over 16 bit integer, which would be faster and less complex?
Yes, artificial (and real) neurons deliver a weighted average of positive and negative inputs. So you usually have large positive and negative inputs which subtractively cancel to a moderate output. Integer doesn't handle this subtractive cancellation nearly as well as floating point, which can keep the same precision over large changes in scale.
Re: (Score:3)
Of course they do and do it with 0.00001525878 precision to boot. If that's all you need then you can get by with 16 bit numbers just fine.
Re:somewhat deceiving numbers.... (Score:5, Informative)
I don't think that's a very good explanation. If sigmoid or step neurons only used numbers in the range [0,1], then you could divide the range into 65,536 individual states and use 16 bits to translate [0,65535] into [0,1]. However, sigmoid neurons have many inputs of many different weights, so the total input to a sigmoid neuron can be greater than one. In fact, any one input, after weighting, can be greater than one. The weights themselves can be greater than one. Only the output is constrained to [0,1] by the sigmoid or step function.
In order to represent a number without a lot of accuracy, but keep the ability to represent large and small values, you need a floating-point number. I'm no expert in deep learning, but it does pass the sniff test that a 16-bit float would be good enough for neurons. I assume that NVIDIA has done their homework and determined that FP-8 numbers have too much rounding error to be useful in a neural network.
Re: (Score:2)
Now with floating point 0.5*0.5 = 0.25 which is a smaller number as expected. If you multiply two positive integers like 50*50 you get 2500, so a larger value which requires further operations on it for it to be useful.
The only "further operation" needed is to look at the higher word of the result which takes zero extra effort. For example, if you multiply two 16-bit words then you get a 32-bit result. The "extra effort" is taking the upper 16-bits of the result and ignoring the lower 16-bits.
There may well be good reasons for FP16 to preferred over using integers but scaling the result of multiplications isn't one of them.
Re: (Score:2)
The only "further operation" needed is to look at the higher word of the result which takes zero extra effort. For example, if you multiply two 16-bit words then you get a 32-bit result. The "extra effort" is taking the upper 16-bits of the result and ignoring the lower 16-bits.
So, multiplying 100 by 100 equals 0, but starting at 0 and adding 100 for 100 times equals 10000 ?
Re: (Score:1)
Shaders for mobile GPUs use 1/2 precision quite a bit, however the small range (-2 - 2) is a problem for many operations, so you end up only being able to use them for less than half of the code.
I for one welcome our () Overlords (Score:5, Funny)
Just imagine a beowulf clust........
Oh, never mind.....
Re: (Score:2)
Re: (Score:2)
Tesla P100? (Score:1)
Re: (Score:3)
Musk can't copyright the name of a famous scientist.
TRADEMARK, not copyright ... and yes he can, but only for a narrow commercial purpose. Elon owns the trademark "Tesla" as a car brand. NVIDA owns the trademark "Tesla" as a GPU brand.
Re: (Score:2)
Musk can't copyright the name of a famous scientist.
TRADEMARK, not copyright ... and yes he can, but only for a narrow commercial purpose. Elon owns the trademark "Tesla" as a car brand. NVIDA owns the trademark "Tesla" as a GPU brand.
And C&C owns "Tesla" as a tank
The Most Advanced Hyperscale Datacenter GPU Ever (Score:2)
My brain has unlimited storage capacity and speed (Score:2)
Decent game AI when? (Score:1)
History repeats? (Score:2)
Hmmm... NVIDIA. Giant chip.
Bill Dally, are you going for a "jump approximate" instruction again?
600 square millimeters ???? (Score:2)
Re:600 square millimeters ???? (Score:5, Funny)
Is this right?? 2' x 2' chip?
That's right.
And the package is shaped like Stonehenge.
Re: (Score:2)
Is this right?? 2' x 2' chip?
No, it's not right.
600mm^2 is a chip just under 25mm on a side.
Re: (Score:2)
Square root of 600mm = 24.4948974278mm
24.4948974278 mm = 0.964366 inches (0.964366")
Even still
20 Tflops at half precision (Score:2)
We all know how this ends (Score:2)
Deep learning leads to Deep Thought leads to forty two.
Or.... (Score:3)
https://www.youtube.com/watch?... [youtube.com]
Numbers don't add up (Score:2)
Tesla P100 (Score:2)
So can it run 600 km on a single charge?
Deep learning about morality and post-scarcity? (Score:2)
An aside from the article: "Huang showed a demo from Facebook that used deep learning to train a neural network how to recognize a landscape painting. They then used the network to create its own landscape painting."
So long for such jobs... How about deep learning about post-scarcity economics?
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://en.wikipedia.org/wiki/... [wikipedia.org]
Also: ""Our strategy is to accelerate deep learning everywhere," Huang said."
How about some deep learning about morality? Imagine training children (or
Imagine what a BEOWULF CLUSTER of these could do!! (Score:1)