Slashdot Log In
NVIDIA Shaking Up the Parallel Programming World
Posted by
ScuttleMonkey
on Sat May 03, 2008 04:37 AM
from the best-discoveries-made-by-accident dept.
from the best-discoveries-made-by-accident dept.
An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"
Related Stories
Submission: NVIDIA is shaking up the parallel programming worl by Anonymous Coward
[+]
Technology: AMD Banks On Flood of Stream Apps 124 comments
Slatterz writes "Closely integrating GPU and CPU systems was one of the motivations for AMD's $5.4bn acquisition of ATI in 2006. Now AMD is looking to expand its Stream project, which uses graphics chip processing cores to perform computing tasks normally sent to the CPU, a process known as General Purpose computing on Graphics Processing Units (GPGPU). By leveraging thousands of processing cores on a graphics card for general computing calculations, tasks such as scientific simulations or geographic modelling, which are traditionally the realm of supercomputers, can be performed on smaller, more affordable systems. AMD will release a new driver for its Radeon series on 10 December which will extend Stream capabilities to consumer cards." Reader Vigile adds: "While third-party consumer applications from CyberLink and ArcSoft are due in Q1 2009, in early December AMD will release a new Catalyst driver that opens up stream computing on all 4000-series parts and a new Avivo Video Converter application that promises to drastically increase transcoding speeds. AMD also has partnered with Aprius to build 8-GPU stream computing servers to compete with NVIDIA's Tesla brand."
[+]
Technology: NVIDIA's $10K Tesla GPU-Based Personal Supercomputer 236 comments
gupg writes "NVIDIA announced a new category of supercomputers — the Tesla Personal Supercomputer — a 4 TeraFLOPS desktop for under $10,000. This desktop machine has 4 of the Tesla C1060 computing processors. These GPUs have no graphics out and are used only for computing. Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision and about 80 GigaFLOPS double-precision floating point performance. The CPU + GPU is programmed using C with added keywords using a parallel programming model called CUDA. The CUDA C compiler/development toolchain is free to download. There are tons of applications ported to CUDA including Mathematica, LabView, ANSYS Mechanical, and tons of scientific codes from molecular dynamics, quantum chemistry, and electromagnetics; they're listed on CUDA Zone."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
need some brains (Score:2, Funny)
Re: (Score:2, Funny)
Dumbing down (Score:5, Funny)
Wow, I bet nobody on slashdot knew that!
Re: (Score:2)
HOW does CUDA make it easier? I'm very confident it's not because Nvidia hardware contains lots of stream processors.
Ohwell, guess I need to RTFA, an
Re: (Score:2)
So to put it another way, the big thr
Re: (Score:3, Funny)
Re: (Score:2, Informative)
CUDA ("Compute Unified Device Architecture"), is a GPGPU technology that allows a programmer to use the C programming language to code algorithms for execution on the graphics processing unit (GPU).
Re: (Score:2)
Why would anyone but a fairly advanced programmer be interested in the new fads in parallel programming ? Besides, the summary is misleading, giving the impression that multithreading is exclusive to multicore processors, which is false; it can give huge benefits in a s
Re: (Score:2)
IAACS, multi-threading and parallel processing are two different but related concepts. The hard part is coming up with a parallel algorithm for certain classes of problems, implementing low level syncronization is trivial by comparison. OTOH I've seen a lot of programmers stab themselves in the eye with forks.
Where's the story? (Score:4, Informative)
Re: (Score:3, Insightful)
Re: (Score:3, Funny)
Re: (Score:2)
Re: (Score:2)
-Why would character movement need to run at a certain rate? It sounds like the thread should spend most of its time blocked waiting for user input.
-What's so special about the audio thread? Shouldn't it just handle events from other threads without communicating back? It can block when it doesn't
Re: (Score:3, Informative)
You usually have a game-physics engine running, which practically integrates the movements of the characters (character movement) or generally updates the world model (position and state of all objects). Even without input, the world moves on. The fixed rate is usually taken, because it is simpler than a varying time-step rate.
-What's so special about the audio
Re: (Score:2)
Also, the article would've done better just talking about the thread manager you mention. That makes more sense than the stuff about semaphores affecting performance positively (unless I misunderstood the sentence about the cache no longer being stale).
And, uh, that drawer comment was a joke...
Thats.. (Score:5, Funny)
Re: (Score:2)
But we already know it's hard to split up all kinds of work evenly.
Anyway, what does CUDA to help with that?
CUDA helps by... (Score:2)
Um, no, that can't be right...
CUDA = NVIDIA desperate to compete with Intel? (Score:5, Insightful)
First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles. Second, nobody wants to tie themselves to CUDA when it's built entirely for proprietary hardware. Third, CUDA still has a *lot* of limitations. It's not as easy to develop a physics engine for a GPU using CUDA as it is for a general purpose CPU.
Now, I haven't used CUDA lately, so I could be way off base here. However, multi-threading isn't the real challenge to efficient use of resources in a parallel computing environment. It's designing your algorithms to be able to run in parallel in the first place. Most multi-threaded software out there still has threads that have to run on a single CPU, and the entire package bottlenecks on the single CPU running that thread even if other threads are free to run on other processors. This sort of bottleneck can only be avoided at the algorithm level. This isn't something CUDA is going to fix.
Now, I can certainly see why NVIDIA is playing up CUDA for all they're worth. Video game graphics rendering could be on the cusp of a technological singularity. Namely, ray tracing. Ray tracing is becoming feasible to do in real time. It's a stretch at present, but time will change that. Ray tracing is a significant step forward in terms of visual quality, but it also makes coding a lot of other things relatively easy. Valve's recent "Portal" required some rather convoluted hacks to render the portals with acceptable performance, but in a ray tracing engine those same portals only take a couple lines of code to implement and have no impact on performance. Another advantage of ray tracing is that it's dead simple to parallelize. While current approaches to video game graphics are going to get more and more difficult to work with as parallel processing rises, ray tracing will remain simple.
The real question is whether NVIDIA is poised to do ray-tracing better than Intel in the next few product cycles. Intel is hip to all of the above, and they can smell blood in the water. If they can beef up the floating point performance of their processors then dedicated graphics cards may soon become completely unnecessary. NVIDIA is under the axe and they know it, which might explain all the recent anti-Intel smack-talk. Still, it remains to be seen who can actually walk the walk.
Parent
Re: (Score:2)
First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles.
That's not strictly true. Off the top of my head: Sorting, FFTs (or any other dense Linear Algebra) and Crypto (both public key and symmetric) covers quite a lot of range. The only real issue for these application is the large batch sizes necessary to overcome the latency. Some of this is inherent in warming up that many pipes, but most of it is shit drivers and slow buses.
The real question is what benefits will CUDA offer when the vector array moves closer to the processor? Most of the papers with the abo
Re: (Score:2)
The advantages would be (assuming this is the wonderful solution it claims) you run your task in the CUDA environment, if your client only has a pile of 1U racks then he can at least run it, if he replaces a few of them with some Tesla [nvidia.com] racks, things will speed up a lot.
I did some programming at college, I do not claim to know anything about the workings of Tesla or CUDA, but it sure sounds rosy if
Re: (Score:2)
NVidia is doing that? an insult to INMOS... (Score:5, Interesting)
Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform [wikipedia.org] that solved the problem of parallelism, in many ways similar to CUDA.
Ironically, some of the first demos I saw using transputers was raytracing demos [classiccmp.org].
The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.
Re: (Score:3, Interesting)
Happy Days at UKC.
couldn't resist a quick Inmos story... (Score:5, Interesting)
Parent
New programming tools needed (Score:4, Insightful)
But why should I?
What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO.
All computer programming languages, and even just plain applications, are abstractions from the computer hardware. What is needed are more robust abstractions to make programming for multiple processors (or cores) easier and more intuitive.
Re: (Score:3, Interesting)
There are a couple of approaches that work well. If you use a functional language, then you can use monads to indicate side effects and the compiler can implicitly parallelise the parts that are free from side effects. If you use a language like Erlang or Pict based on a CSP or a
More investment needed in e.g Erlang (Score:4, Interesting)
The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.
So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.
I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.
Parent
Re: (Score:2, Interesting)
You might be interested in some w
Yes, I read your paper (Score:3, Interesting)
Re:New programming tools needed (Score:5, Interesting)
Consider this parallel programing pseudo-example
find | tar | compress | remote-execute 'remote-copy | uncompress | untar'
This is a 7 process FULLY parallel pipeline (meaning non-blocking at any stage - every 512 bytes of data passed from one stage to the next gets processed immediately). This can work with 2 physical machines that have 4 processing units each, for a total of 8 parallel threads of execution.
Granted, it's hard to construct a UNIX pipe that doesn't block.. The following variation blocks on the xargs, and has less overhead than separate tar/compress stages but is single-threaded
find name-pattern | xargs grep -l contents-pattern | tar-gzip | remote-execute 'remote-copy | untar-unzip'
Here the message-passing are serialized/linearized data.. But that's the power of UNIX.
In CORBA/COM/GNORBA/Java-RMI/c-RPC/SOAP/HTTP-REST/ODBC, your messages are 'remoteable' function calls, which serialize complex parameters; much more advanced than a single serial pipe/file-handle. They also allow synchronous returns. These methodologies inherently have 'waiting' worker threads.. So it goes without saying that you're programming in an MT environment.
This class of Remote-Procedure-Calls is mostly for centralization of code or central-synchronization. You can't block on a CPU mutex that's on another physically separate machine.. But if your RPC to a central machine with a single variable mutex then you can.. DB locks are probably more common these days, but it's the exact same concept - remote calls to a central locking service.
Another benifit in this class of IPC (Inter Process Communication) is that a stage or segment of the problem is handled on one machine.. BUt a pool of workers exists on each machine.. So while one machine is blocking, waiting for a peer to complete a unit of work, there are other workers completing their stage.. At any given time on every given CPU there is a mixture of pending and processing threads. So while a single task isn't completed any faster, a collection of tasks takes full advantage of every CPU and physical machine in the pool.
The above RPC type models involve explicit division of labor. Another class are true opaque messages.. JMS, and even UNIX's 'ipcs' Message Queues. In Java it's JMS. The idea is that you have the same workers as before, but instead of having specific UNIQUE RPC URI's (addresses), you have a common messaging pool with a suite of message-types and message-queue-names. You then have pools of workers that can live ANYWHERE which listen to their queues and handle an array of types of pre-defined messages (defined by the application designer). So now you can have dozens or hundreds of CPUs, threads, machines all symmetriclly passing asynchronous messages back and forth.
To my knowledge, this is the most scaleable type of problem.. You can take most procedural problems and break them up into stages, then define a message-type as the explicit name of each stage, then divide up the types amongst different queues (which would allow partitioning/grouping of computational resources), then receive-message/process-message/forward-or-reply-message. So long as the amount of work far exceeds the overhead of message passing, you can very nicely scale with the amount of hardware you can throw at the problem.
Parent
Re: (Score:2)
Unix pipes are a very primitive example of a dataflow language [wikipedia.org].
Re: (Score:3, Insightful)
Re: (Score:2)
When I came up through my CS degree, object-oriented programming was new. Programming was largely a series of sequentially ordered instructions. I haven't programmed in many years now, but if I wanted to write a parallel program I would not have a clue.
But why should I?
What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO.
Crikey, when was your CS degree? Mine was a long time ago, yet I still learned parallel programming concepts (using the occam [wikipedia.org] language).
Re: (Score:2)
Uh, what a crap (Score:4, Informative)
But not if posted by The Ignorant.
What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.
If a student of mine wrote this, a Fail will be the immediate consequence. How can 400 fps be 'only'? And why is threading bad, if the character movement is ready after 1/400 second? There is not 'a lot of waiting'; instead, there are a lot of cycles to calculate something else. and 'waiting' is not 'synchronisation'.
[The audio-rate of 7000 fps gave the author away; and I stopped reading. Audio does not come in fps.]
While we all agree on the problem of synchronisation in parallel programming, and maybe especially in the gaming world, we should not allow uninformed blurb on Slashdot.
Re: (Score:2)
How can 400 fps be 'only'?
You are responding to the following (hypothetical) statement:
but it can be rendered at only 400 fps
Which is different from the one written:
but it can only be rendered at 400 fps
See the difference?
yawn (Score:2)
Maybe nVidia will popularize parallel programming, maybe not. But I don't see any "shake up" or break throughs there.
CUDA is limiting, not liberating (Score:5, Informative)
Re: (Score:2)
The EETimes article is much better (Score:4, Informative)
Blog spam. Link to actual article. Nvidia loss? (Score:3, Interesting)
Nvidia is showing signs of being poorly managed. CUDA [cuda.com] is a registered trademark of another hi-tech company.
The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.
No one is going to "solve" the problem (Score:2)
Multi-threaded programming is a fundamentally hard problem, as is the more general issue of maximally-efficient scheduling of any dynamic resource. No one idea, tool or company is going to "solve" it. What will happen is that lots of individual ideas, approaches, tools and companies will individually address little parts of the problem, making it incrementally easier to produce efficient multi-threaded code. Some of these approaches will work together, others will be in opposition, there will be engineer
Reminds me of OLD the stories I used to hear... (Score:4, Interesting)
As I recall:
The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.
These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.
Re: (Score:2, Insightful)
Re: (Score:2)
Look at the early OpenGL registry extension specifications - vendors couldn't even agree on what vector arithmetic instructions to implement.
Re: (Score:3, Insightful)
However, having mentioned Microsoft... If *someone* does want OpenGL to succeed it is them... If and when OpenGL 3.0 ever appears, I bet there will be some talk of some "unknown party" threatening patent litigation...
Destroying OpenGL is of paramount important to Microsoft,
Re: (Score:3, Informative)
But you can't have a 12GHz, at that speed light goes about ONE INCH per clock cycle in a vacuum, anything else is slower, signals in silicon are a lot slower.
An inch is a long way on a CPU. A Core 2 die is around 11mm along the edge, so at 12GHz a signal could go all of the way from one edge to the other and back. It uses a 14-stage pipeline, so every clock cycle a signal needs to travel around 1/14th of the way across the die, giving around 1mm. If every signal needs to move 1mm per cycle and travels at the speed of light, then your maximum clock speed is 300GHz.
Of course, as you say, electric signals travel a fair bit slower in silicon than photons do
Re: (Score:2)