NVIDIA Shaking Up the Parallel Programming World 154
An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"
NVidia is doing that? an insult to INMOS... (Score:5, Interesting)
Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform [wikipedia.org] that solved the problem of parallelism, in many ways similar to CUDA.
Ironically, some of the first demos I saw using transputers was raytracing demos [classiccmp.org].
The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.
Re:NVidia is doing that? an insult to INMOS... (Score:3, Interesting)
Happy Days at UKC.
Re:New programming tools needed (Score:3, Interesting)
There are a couple of approaches that work well. If you use a functional language, then you can use monads to indicate side effects and the compiler can implicitly parallelise the parts that are free from side effects. If you use a language like Erlang or Pict based on a CSP or a pi-calculus model then you split your program into logically independent chunks with a message passing interface between them the compiler or runtime can schedule them independently.
couldn't resist a quick Inmos story... (Score:5, Interesting)
More investment needed in e.g Erlang (Score:4, Interesting)
The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.
So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.
I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.
Re:More investment needed in e.g Erlang (Score:2, Interesting)
You might be interested in some work I did evaluating Erlang on a 16 core SMP machine:
http://jkndrkn.livejournal.com/205249.html [livejournal.com]
Quick summary: Erlang is slow, though using the Array module for data structure manipulation can help matters. Erlang could still be useful as a communications layer or monitoring system for processes writen in C.
Re:New programming tools needed (Score:5, Interesting)
Consider this parallel programing pseudo-example
find | tar | compress | remote-execute 'remote-copy | uncompress | untar'
This is a 7 process FULLY parallel pipeline (meaning non-blocking at any stage - every 512 bytes of data passed from one stage to the next gets processed immediately). This can work with 2 physical machines that have 4 processing units each, for a total of 8 parallel threads of execution.
Granted, it's hard to construct a UNIX pipe that doesn't block.. The following variation blocks on the xargs, and has less overhead than separate tar/compress stages but is single-threaded
find name-pattern | xargs grep -l contents-pattern | tar-gzip | remote-execute 'remote-copy | untar-unzip'
Here the message-passing are serialized/linearized data.. But that's the power of UNIX.
In CORBA/COM/GNORBA/Java-RMI/c-RPC/SOAP/HTTP-REST/ODBC, your messages are 'remoteable' function calls, which serialize complex parameters; much more advanced than a single serial pipe/file-handle. They also allow synchronous returns. These methodologies inherently have 'waiting' worker threads.. So it goes without saying that you're programming in an MT environment.
This class of Remote-Procedure-Calls is mostly for centralization of code or central-synchronization. You can't block on a CPU mutex that's on another physically separate machine.. But if your RPC to a central machine with a single variable mutex then you can.. DB locks are probably more common these days, but it's the exact same concept - remote calls to a central locking service.
Another benifit in this class of IPC (Inter Process Communication) is that a stage or segment of the problem is handled on one machine.. BUt a pool of workers exists on each machine.. So while one machine is blocking, waiting for a peer to complete a unit of work, there are other workers completing their stage.. At any given time on every given CPU there is a mixture of pending and processing threads. So while a single task isn't completed any faster, a collection of tasks takes full advantage of every CPU and physical machine in the pool.
The above RPC type models involve explicit division of labor. Another class are true opaque messages.. JMS, and even UNIX's 'ipcs' Message Queues. In Java it's JMS. The idea is that you have the same workers as before, but instead of having specific UNIQUE RPC URI's (addresses), you have a common messaging pool with a suite of message-types and message-queue-names. You then have pools of workers that can live ANYWHERE which listen to their queues and handle an array of types of pre-defined messages (defined by the application designer). So now you can have dozens or hundreds of CPUs, threads, machines all symmetriclly passing asynchronous messages back and forth.
To my knowledge, this is the most scaleable type of problem.. You can take most procedural problems and break them up into stages, then define a message-type as the explicit name of each stage, then divide up the types amongst different queues (which would allow partitioning/grouping of computational resources), then receive-message/process-message/forward-or-reply-message. So long as the amount of work far exceeds the overhead of message passing, you can very nicely scale with the amount of hardware you can throw at the problem.
Re:CUDA is limiting, not liberating (Score:1, Interesting)
That said, for a truly data parallel calculation, CUDA blows any affordable multicore solution out of the water. By removing some of the flexibility, GPUs can spend their transistor budget on more floating point units and bigger memory buses.
OpenMP is nice, but it will be a while before multicore CPUs can offer you 128 (single precision) floating point units fed by a 70 GB/sec memory bus.
(I'm willing to forgive more of the limitations of CUDA because now putting a $350 card into a workstation gives us a 10x performance improvement in a program we use very frequently. Speed-ups in the range of 10x to 40x are pretty common for the kind of data parallel tasks that CUDA is ideal for. If you only see a 2x or 3x improvement, you are probably better off with OpenMP and/or SSE.)
Yes, I read your paper (Score:3, Interesting)
That in a nutshell is why I suggested that investment in Erlang would be a good idea. It's better to start with the right approach and optimise it, than go off into computer science blue sky and try to design a perfect language for paralleling GPUs - which practically nobody will ever really use.
Blog spam. Link to actual article. Nvidia loss? (Score:3, Interesting)
Nvidia is showing signs of being poorly managed. CUDA [cuda.com] is a registered trademark of another hi-tech company.
The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.
Re:just hype and commercialism (Score:1, Interesting)
It's definitely not just hype. In our company, we're using it to speed up some image processing algorithms which can now be applied in real time by just utilizing the <100$ video card in the PC. We are quite excited about this, as we would otherwise have to invest in expensive special purpose hardware accelerators (which are usually obsolete by the time they're designed, so you spend the rest of their life time paying off on hardware which has already lost its edge).
Perhaps the CUDA model in itself is a little clumsy; the fact that it opens up commodity hardware to do some impressive work is very nice.
Computer Programming: An Introduction for the Scientifically Inclined [curly-brace.com]
Reminds me of OLD the stories I used to hear... (Score:4, Interesting)
As I recall:
The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.
These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.
Re:NVidia is doing that? an insult to INMOS... (Score:1, Interesting)
Transputers were essentially an embodiment of Tony Hoare's CSP (communicating sequential processes) and is quite general and powerfully expressive. Cuda (in occam-like-ease) has no message passing, but is shared memory in nature.
SEQ block.num=0 FOR desired.num.of.blocks
[...]INT global_mem:
PAR cta=0 FOR num_ctas
[...]INT local_mem:
PAR warp_group=0 FOR warps_groups_in_cta
SEQ barrier=0 FOR num.barriers[warp_group]
PAR warp=0 FOR num_warps[warp_group]
PAR thread=0 FOR 16
thread.id
proc[barrier](block.num, thread.id)
With no barriers (common in many problems), this reduces to...
SEQ block.num=0 FOR desired.num.of.blocks
[...]INT global_mem:
PAR cta=0 FOR num_ctas
[...]INT local_mem:
PAR warp=0 FOR num_warps.in.cta
PAR thread=0 FOR 16
thread.id
proc(block.num, thread.id)
The procedure can use barriers, but cannot communicate (in the CSP sense) other than through shared local memory. In many ways CUDA is more like SIMD than CSP, and can be thought of as arrays of cooperating parallel SIMD threads (or cooperating thread array cta).
The primary difference is that the flexibility in the general CSP model puts quite a bit of overhead in the scalar part of the implementations which need to be replicated for parallelism. In the CUDA model, the use of SIMD-thread parallel model amortizes the scalar parts of the implemenation over multiple threads reducing overhead cost-per-thread.
The processes rendevous model in CSP is quite general, but need to be integrated into the thread scheduling making thread scheduling quite heavyweight (linked list in the transputer implementation). With the CTA/CUDA model, threads are highly structured and not-dynamic. Threads are dispatched in preconfigured arrays, not per-thread and not dynamically creatable (yet) so they are very efficient to dispatch. Of course after they are running, they can still be temporarily stalled independently (sort of like hyperthreading in CPU cores on register data dependencies, or like in the transputer when you loop communicate). This means the difference between a multi-thread model (transputer processing several threads efficiently), and a many thread model (cuda processing thousands of threads efficiently) for practical implementations.
In short, Cuda is a very structured model that allows for high efficiency. CSP is a very generic model which has lots of flexiblity.
In many ways, INMOS was a pioneer, but as with most pioneers, they got lots of arrows in their back (occam, secret assembly language, propritary vlsi design process, etc), although they eventually corrected most of these (icc c-compiler, transputer assembly language book, ported to verilog, etc), but by then, the industry had passed it by (not to mention, they stuck with the 3-operand stack model way-way-way too long)...