Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Graphics Programming Software Hardware IT Technology

NVIDIA Shaking Up the Parallel Programming World 154

An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"
This discussion has been archived. No new comments can be posted.

NVIDIA Shaking Up the Parallel Programming World

Comments Filter:
  • by master_p ( 608214 ) on Saturday May 03, 2008 @06:27AM (#23283212)

    Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform [wikipedia.org] that solved the problem of parallelism, in many ways similar to CUDA.

    Ironically, some of the first demos I saw using transputers was raytracing demos [classiccmp.org].

    The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.

  • by ratbag ( 65209 ) on Saturday May 03, 2008 @07:24AM (#23283352)
    That takes me back. My MSc project in 1992 was visualizing 3D waves on Transputers using Occam. Divide the wave into chunks, give each chunk to a Transputer, pass the edge case between the Transputers and let one of them look after the graphics. Seem to recall there were lots of INs and OUTs. A friend of mine simulated bungie jumps using similar code, with a simple bit of finite element analysis chucked in (the rope changed colour based on the amount of stretch).

    Happy Days at UKC.
  • by TheRaven64 ( 641858 ) on Saturday May 03, 2008 @07:46AM (#23283414) Journal
    There's only so much that a compiler can do. If you structure your algorithms serially then a compiler can't do much. If you write parallel algorithms then it's relatively easy for the compiler to turn it into parallel code.

    There are a couple of approaches that work well. If you use a functional language, then you can use monads to indicate side effects and the compiler can implicitly parallelise the parts that are free from side effects. If you use a language like Erlang or Pict based on a CSP or a pi-calculus model then you split your program into logically independent chunks with a message passing interface between them the compiler or runtime can schedule them independently.

  • by Fallen Andy ( 795676 ) on Saturday May 03, 2008 @07:54AM (#23283432)
    Back in the early 80's I was working in Bristol UK for TDI (who were the UCSD p-system licensees) porting it to various machines... Well, we had one customer who wanted a VAX p-system so we trotted off to INMOS's office and sat around in the computer room. (VAX 11/780 I think). At the time they were running Transputer simulations on the machine so the VAX p-system took er... about 30 *minutes* to start. Just for comparison an Apple ][ running IV.x would take less than a minute. Almost an hour to make a tape. (About 15 users running emulation I think). Fond memories of the transputer. Almost bought a kit to play with it... Andy
  • by Kupfernigk ( 1190345 ) on Saturday May 03, 2008 @08:26AM (#23283538)
    The approach used by Erlang is interesting as it is totally dependent on message passing between processes to achieve parallelism and synchronisation. To get real time performance, the message passing must be very efficient. Messaging approaches are well suited to parallelism where the parallel process are themselves CPU and data intensive, which is why they work well for cryptography and image processing. From this point of view alone, a parallel architecture using GPUs with very fast intermodule channels looks like a good bet.

    The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.

    So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.

    I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.

  • by jkndrkn ( 1092917 ) on Saturday May 03, 2008 @09:06AM (#23283678)
    > and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. While shared-memory tools like UPC and OpenMP are gaining ground (especially with programmers), I too feel that they are a step backwards. Message passing languages, especially Erlang, are better designed to cope with the unique challenges of computing on a large parallel computer due to their excellent fault tolerance features.

    You might be interested in some work I did evaluating Erlang on a 16 core SMP machine:

    http://jkndrkn.livejournal.com/205249.html [livejournal.com]

    Quick summary: Erlang is slow, though using the Array module for data structure manipulation can help matters. Erlang could still be useful as a communications layer or monitoring system for processes writen in C.
  • by maraist ( 68387 ) * <{michael.maraist ... mail.n0spam.com}> on Saturday May 03, 2008 @09:51AM (#23283880) Homepage
    Consider that if you've ever done UNIX programming, you've been doing MT programming all along - just by a different name.. Multi-Processing. Pipelines are, in IMO the best implementation of parallel programming (and UNIX is FULL of pipes). You take a problem and break it up into wholly independent stages, then multi process or multi-thread the stages. If you can split the problem up using message-passing then you can farm the work out to decoupled processes on remote machines, and you get farming / clustering. Once you have the problem truely clustered, then multi-threading is just a cheaper implementation of multi-processing (less overhead per worker, less number of physical CPUs, etc).

    Consider this parallel programing pseudo-example

    find | tar | compress | remote-execute 'remote-copy | uncompress | untar'

    This is a 7 process FULLY parallel pipeline (meaning non-blocking at any stage - every 512 bytes of data passed from one stage to the next gets processed immediately). This can work with 2 physical machines that have 4 processing units each, for a total of 8 parallel threads of execution.

    Granted, it's hard to construct a UNIX pipe that doesn't block.. The following variation blocks on the xargs, and has less overhead than separate tar/compress stages but is single-threaded

    find name-pattern | xargs grep -l contents-pattern | tar-gzip | remote-execute 'remote-copy | untar-unzip'

    Here the message-passing are serialized/linearized data.. But that's the power of UNIX.

    In CORBA/COM/GNORBA/Java-RMI/c-RPC/SOAP/HTTP-REST/ODBC, your messages are 'remoteable' function calls, which serialize complex parameters; much more advanced than a single serial pipe/file-handle. They also allow synchronous returns. These methodologies inherently have 'waiting' worker threads.. So it goes without saying that you're programming in an MT environment.

    This class of Remote-Procedure-Calls is mostly for centralization of code or central-synchronization. You can't block on a CPU mutex that's on another physically separate machine.. But if your RPC to a central machine with a single variable mutex then you can.. DB locks are probably more common these days, but it's the exact same concept - remote calls to a central locking service.

    Another benifit in this class of IPC (Inter Process Communication) is that a stage or segment of the problem is handled on one machine.. BUt a pool of workers exists on each machine.. So while one machine is blocking, waiting for a peer to complete a unit of work, there are other workers completing their stage.. At any given time on every given CPU there is a mixture of pending and processing threads. So while a single task isn't completed any faster, a collection of tasks takes full advantage of every CPU and physical machine in the pool.

    The above RPC type models involve explicit division of labor. Another class are true opaque messages.. JMS, and even UNIX's 'ipcs' Message Queues. In Java it's JMS. The idea is that you have the same workers as before, but instead of having specific UNIQUE RPC URI's (addresses), you have a common messaging pool with a suite of message-types and message-queue-names. You then have pools of workers that can live ANYWHERE which listen to their queues and handle an array of types of pre-defined messages (defined by the application designer). So now you can have dozens or hundreds of CPUs, threads, machines all symmetriclly passing asynchronous messages back and forth.

    To my knowledge, this is the most scaleable type of problem.. You can take most procedural problems and break them up into stages, then define a message-type as the explicit name of each stage, then divide up the types amongst different queues (which would allow partitioning/grouping of computational resources), then receive-message/process-message/forward-or-reply-message. So long as the amount of work far exceeds the overhead of message passing, you can very nicely scale with the amount of hardware you can throw at the problem.
  • by Anonymous Coward on Saturday May 03, 2008 @10:10AM (#23283984)
    You make a good point: The data-parallel computing model used in CUDA is very unfamiliar to programmers. You might read the spec sheet and see "128 streaming processors" and think that is the same as having 128 cores, but it is not. CUDA inhabits a world somewhere between SSE and OpenMP in terms of task granularity. I blame part of this confusion on Nvidia's adoption of familiar sounding terms like "threads" and "processors" for things which behave nothing like threads and processors from a multicore programming perspective. Managing people's expectations is an important part of marketing a new tech, and hype can lead to anger.

    That said, for a truly data parallel calculation, CUDA blows any affordable multicore solution out of the water. By removing some of the flexibility, GPUs can spend their transistor budget on more floating point units and bigger memory buses.

    OpenMP is nice, but it will be a while before multicore CPUs can offer you 128 (single precision) floating point units fed by a 70 GB/sec memory bus. :)

    (I'm willing to forgive more of the limitations of CUDA because now putting a $350 card into a workstation gives us a 10x performance improvement in a program we use very frequently. Speed-ups in the range of 10x to 40x are pretty common for the kind of data parallel tasks that CUDA is ideal for. If you only see a 2x or 3x improvement, you are probably better off with OpenMP and/or SSE.)
  • by Kupfernigk ( 1190345 ) on Saturday May 03, 2008 @11:07AM (#23284304)
    It doesn't surprise me in the slightest. Erlang is designed from the ground up for pattern matching rather than computation, because it was designed for use in messaging systems - telecoms, SNMP, now XMPP. Its integer arithmetic is arbitrary precision, which prevents overflow in integer operations at the expense of performance. Its floating point is limited. My early work on a 3-way system used hand coded assembler to drive the interprocess messaging using hardware FIFOs, for Pete's sake, and that was as high performance as you could get - given the huge limitations of trying to write useful functions in assembler.

    That in a nutshell is why I suggested that investment in Erlang would be a good idea. It's better to start with the right approach and optimise it, than go off into computer science blue sky and try to design a perfect language for paralleling GPUs - which practically nobody will ever really use.

  • by Futurepower(R) ( 558542 ) on Saturday May 03, 2008 @11:49AM (#23284520) Homepage
    Avoid the blog spam. This is the actual article in EE times: Nvidia unleashes Cuda attack on parallel-compute challenge [eetimes.com].

    Nvidia is showing signs of being poorly managed. CUDA [cuda.com] is a registered trademark of another hi-tech company.

    The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.
  • by Anonymous Coward on Saturday May 03, 2008 @12:31PM (#23284774)

    It's definitely not just hype. In our company, we're using it to speed up some image processing algorithms which can now be applied in real time by just utilizing the <100$ video card in the PC. We are quite excited about this, as we would otherwise have to invest in expensive special purpose hardware accelerators (which are usually obsolete by the time they're designed, so you spend the rest of their life time paying off on hardware which has already lost its edge).

    Perhaps the CUDA model in itself is a little clumsy; the fact that it opens up commodity hardware to do some impressive work is very nice.

    Computer Programming: An Introduction for the Scientifically Inclined [curly-brace.com]

  • by JRHelgeson ( 576325 ) on Saturday May 03, 2008 @01:00PM (#23284920) Homepage Journal
    I live in Minnesota, home of the legendary Cray Research. I've met with several old timers that developed the technologies that made the Cray Supercomputer what it was. Hearing about the problems that multi-core developers are facing today reminds me of the stories I heard about how the engineers would have to build massive cable runs from processor board to processor board to memory board just to synchronize the clocks and operations so that when the memory was ready to read or write data, it could tell the processor board... half a room away.

    As I recall:
    The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.

    These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.
  • by Anonymous Coward on Saturday May 03, 2008 @03:21PM (#23285784)
    I worked at INMOS, now I work for NVIDIA (and I'm posting this anon). Believe me when I say, the way INMOS and NVIDIA solved the problems of parallelism are not the same.

    Transputers were essentially an embodiment of Tony Hoare's CSP (communicating sequential processes) and is quite general and powerfully expressive. Cuda (in occam-like-ease) has no message passing, but is shared memory in nature.

    SEQ block.num=0 FOR desired.num.of.blocks
        [...]INT global_mem:
        PAR cta=0 FOR num_ctas
            [...]INT local_mem:
            PAR warp_group=0 FOR warps_groups_in_cta
                SEQ barrier=0 FOR num.barriers[warp_group]
                    PAR warp=0 FOR num_warps[warp_group]
                        PAR thread=0 FOR 16
                            thread.id := warp*16+thread
                            proc[barrier](block.num, thread.id)

    With no barriers (common in many problems), this reduces to...

    SEQ block.num=0 FOR desired.num.of.blocks
        [...]INT global_mem:
        PAR cta=0 FOR num_ctas
            [...]INT local_mem:
            PAR warp=0 FOR num_warps.in.cta
                PAR thread=0 FOR 16
                    thread.id := warp*16+thread
                    proc(block.num, thread.id)

    The procedure can use barriers, but cannot communicate (in the CSP sense) other than through shared local memory. In many ways CUDA is more like SIMD than CSP, and can be thought of as arrays of cooperating parallel SIMD threads (or cooperating thread array cta).

    The primary difference is that the flexibility in the general CSP model puts quite a bit of overhead in the scalar part of the implementations which need to be replicated for parallelism. In the CUDA model, the use of SIMD-thread parallel model amortizes the scalar parts of the implemenation over multiple threads reducing overhead cost-per-thread.

    The processes rendevous model in CSP is quite general, but need to be integrated into the thread scheduling making thread scheduling quite heavyweight (linked list in the transputer implementation). With the CTA/CUDA model, threads are highly structured and not-dynamic. Threads are dispatched in preconfigured arrays, not per-thread and not dynamically creatable (yet) so they are very efficient to dispatch. Of course after they are running, they can still be temporarily stalled independently (sort of like hyperthreading in CPU cores on register data dependencies, or like in the transputer when you loop communicate). This means the difference between a multi-thread model (transputer processing several threads efficiently), and a many thread model (cuda processing thousands of threads efficiently) for practical implementations.

    In short, Cuda is a very structured model that allows for high efficiency. CSP is a very generic model which has lots of flexiblity.

    In many ways, INMOS was a pioneer, but as with most pioneers, they got lots of arrows in their back (occam, secret assembly language, propritary vlsi design process, etc), although they eventually corrected most of these (icc c-compiler, transputer assembly language book, ported to verilog, etc), but by then, the industry had passed it by (not to mention, they stuck with the 3-operand stack model way-way-way too long)...

It's a naive, domestic operating system without any breeding, but I think you'll be amused by its presumption.

Working...