Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
Graphics Programming Software Hardware IT Technology

NVIDIA Shaking Up the Parallel Programming World 154

An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"
This discussion has been archived. No new comments can be posted.

NVIDIA Shaking Up the Parallel Programming World

Comments Filter:
  • where's the MIT CS guys when you need them?
  • by mrbluze ( 1034940 ) on Saturday May 03, 2008 @05:50AM (#23283116) Journal

    'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores.

    Wow, I bet nobody on slashdot knew that!

    • by Lobais ( 743851 )
      Well, it's a hot copy from tfa.
      If you rtfa you'll notice that it's about "Nvidia's CUDA system, originally developed for their graphics cores, are finding migratory uses into other massively parallel computing applications."
      • Re: (Score:2, Informative)

        by Lobais ( 743851 )
        Oh, and CUDA btw. http://en.wikipedia.org/wiki/CUDA [wikipedia.org]

        CUDA ("Compute Unified Device Architecture"), is a GPGPU technology that allows a programmer to use the C programming language to code algorithms for execution on the graphics processing unit (GPU).
    • by aliquis ( 678370 )
      Yeah, I had prefered if the summary mentioned some short little information about how CUDA helps and does it better. Because as the summary is written now it's like "Maybe a video card dude will fix it because they need to run more threads", not "This video card dude came up with a new language which made it much easier to handle multiple threads", or whatever.

      HOW does CUDA make it easier? I'm very confident it's not because Nvidia hardware contains lots of stream processors.

      Ohwell, guess I need to RTFA, an
      • There is no real detail in the article so dredging my memory for how CUDA works... It probably is because they are stream processors - i.e a pool of vector processors that are optimised for SIMD. The innovation was that the pool could be split into several chunks working on separate SIMD programs. Rather than threads there are programmable barriers to control the different groups and explicit memory locking to ensure the cache is partitioned between the different groups.

        So to put it another way, the big thr
        • by aliquis ( 678370 )
          I don't see what the difference is in synching shared memory access between threads or synching shared (partitioned) memory between programs runnong on stream processors would be thought.
          • If you're syncing shared memory access then you have to worry about memory consistency. This is basically the problem that you would have between separate caches in an SMP system. If you partition the memory then you have a simple form of locking.
            • by aliquis ( 678370 )
              I still don't get it. If I share the RAM I was expecting all the things which shared it to have the same "knowledge" about it. Like say it's representing an ingame map of the environment and everything knows what things are. No matter if it's the own-unit-have-moved, others-units-have-moved or the thread which draws the map.
              • Ok, so think of it like this. If I have 1GB of ram shared amongst 128 processors then I have two choices: one large image (shared) or multiple smaller images (partitioned). If the whole bank is shared then each memory access has to arbitrate for access to the whole bank. The memory ranges of each processor completely overlap so there is always a cost for arbitration to access the resource.

                If we partition the bank into two pools, with 64 processors accessing each pool then we have just cut the arbitration co
    • Re: (Score:3, Funny)

      by Kawahee ( 901497 )
      Slow down cowboy, not all of us are as cluey as you. It didn't come together for me until the last sentence!

      There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization
    • by volpe ( 58112 )
      Well, I, for one, certainly didn't know that. I used to think that multi-threading could be done on a single core!!
  • Where's the story? (Score:4, Informative)

    by pmontra ( 738736 ) on Saturday May 03, 2008 @05:51AM (#23283118) Homepage
    The articles sums up the hurdles of parallel programming and says that NVIDIA's CUDA is doing something to solve them but it doesn't say what. Even the short Wikipedia entry at http://en.wikipedia.org/wiki/CUDA [wikipedia.org] tells more about it.
    • Re: (Score:3, Insightful)

      by mrbluze ( 1034940 )
      No offence, but I'm perplexed as to how this rubbish made it past the firehose.
    • In my opinion it doesn't even summarize the hurdles properly. I'm not a game programmer, so I don't know if the article makes sense, but it left me with the following questions. Hopefully someone can clarify.

      -Why would character movement need to run at a certain rate? It sounds like the thread should spend most of its time blocked waiting for user input.

      -What's so special about the audio thread? Shouldn't it just handle events from other threads without communicating back? It can block when it doesn't
      • Re: (Score:3, Informative)

        by Yokaze ( 70883 )
        -Why would character movement need to run at a certain rate? It sounds like the thread should spend most of its time blocked waiting for user input.

        You usually have a game-physics engine running, which practically integrates the movements of the characters (character movement) or generally updates the world model (position and state of all objects). Even without input, the world moves on. The fixed rate is usually taken, because it is simpler than a varying time-step rate.

        -What's so special about the audio
        • Thanks for the reply, but I still don't understand why audio would be a synchronization issue. As you say, it needs a certain amount of CPU time or it'll stutter, but isn't that a performance issue?

          Also, the article would've done better just talking about the thread manager you mention. That makes more sense than the stuff about semaphores affecting performance positively (unless I misunderstood the sentence about the cache no longer being stale).

          And, uh, that drawer comment was a joke...
        • <blockquote>-How do semaphores affect SMP cache efficiency? Is the CPU notified to keep the data in shared cache?

          Not specially, they are simply a special case of the problem: How to access data
          Several threads may compete for the same data, but if they are accessing the same data in one cache-line, it will lead to lots of communication (thrashing the cache).</blockquote>

          I think you have this wrong. Sharing data in one cache line between processors is not always bad. In fact in multicores this c
    • CUDA is terrible.

      It does nothing to solve the synchronization issues that are the plague of multi-threaded programming, and it makes it all worse by having a very non-uniform memory access model (that hasn't even been abstracted).

      The problem with multi-threaded models is that they are fundamentally harder than a single-threaded model. CUDA does nothing to address this, and it makes it even harder by forcing the programmer to worry about what kind of memory they are using and forcing them to move data in an
  • Thats.. (Score:5, Funny)

    by mastershake_phd ( 1050150 ) on Saturday May 03, 2008 @05:52AM (#23283126) Homepage

    There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.
    That's called wasted CPU cycles.
    • Like I was saying in another post, since everything per game object must be synchronized to the slowest procedure (video rendering of the object), the way to not wasted cpu cycles is to spend it on AI.

      In essence, the faster your CPU then, (static on consoles), the more time you can devote to making your game objects smarter after you're done the audio visual.
    • by aliquis ( 678370 )
      ... and bad planning / lack of effort / simplest solution.

      But we already know it's hard to split up all kinds of work evenly.

      Anyway, what does CUDA to help with that?
    • That's called wasted CPU cycles.
      BOINC for PS3?
    • by Cordath ( 581672 ) on Saturday May 03, 2008 @07:58AM (#23283446)
      CUDA is an interesting way to utilize NVIDIA's graphics hardware for tasks it wasn't really designed for, but it's not a solution to parallel computing in and of itself. (more on that momentarily) A few people have gotten their nice high end Quadros to do some pretty cool stuff, but to date it's been limited primarily to relatively minor academic purposes. I don't see CUDA becoming big in gaming circles anytime soon. Let's face it, most gamers buy *one* reasonably good video card and leave it at that. Your video card has better things to do than handle audio or physics when your multi-core CPU is probably being criminally underutilized. Nvidia, of course, wants people to buy wimpy CPU's and then load up on massive SLI rigs and then do all their multi-purpose computation in CUDA. Not gonna happen.

      First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles. Second, nobody wants to tie themselves to CUDA when it's built entirely for proprietary hardware. Third, CUDA still has a *lot* of limitations. It's not as easy to develop a physics engine for a GPU using CUDA as it is for a general purpose CPU.

      Now, I haven't used CUDA lately, so I could be way off base here. However, multi-threading isn't the real challenge to efficient use of resources in a parallel computing environment. It's designing your algorithms to be able to run in parallel in the first place. Most multi-threaded software out there still has threads that have to run on a single CPU, and the entire package bottlenecks on the single CPU running that thread even if other threads are free to run on other processors. This sort of bottleneck can only be avoided at the algorithm level. This isn't something CUDA is going to fix.

      Now, I can certainly see why NVIDIA is playing up CUDA for all they're worth. Video game graphics rendering could be on the cusp of a technological singularity. Namely, ray tracing. Ray tracing is becoming feasible to do in real time. It's a stretch at present, but time will change that. Ray tracing is a significant step forward in terms of visual quality, but it also makes coding a lot of other things relatively easy. Valve's recent "Portal" required some rather convoluted hacks to render the portals with acceptable performance, but in a ray tracing engine those same portals only take a couple lines of code to implement and have no impact on performance. Another advantage of ray tracing is that it's dead simple to parallelize. While current approaches to video game graphics are going to get more and more difficult to work with as parallel processing rises, ray tracing will remain simple.

      The real question is whether NVIDIA is poised to do ray-tracing better than Intel in the next few product cycles. Intel is hip to all of the above, and they can smell blood in the water. If they can beef up the floating point performance of their processors then dedicated graphics cards may soon become completely unnecessary. NVIDIA is under the axe and they know it, which might explain all the recent anti-Intel smack-talk. Still, it remains to be seen who can actually walk the walk.
      • First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles.

        That's not strictly true. Off the top of my head: Sorting, FFTs (or any other dense Linear Algebra) and Crypto (both public key and symmetric) covers quite a lot of range. The only real issue for these application is the large batch sizes necessary to overcome the latency. Some of this is inherent in warming up that many pipes, but most of it is shit drivers and slow buses.

        The real question is what benefits will CUDA offer when the vector array moves closer to the processor? Most of the papers with the abo

      • by Barny ( 103770 )
        As an article earlier this month [bit-tech.net] pointed out they are in fact in the process of porting the CUDA system to CPUs.

        The advantages would be (assuming this is the wonderful solution it claims) you run your task in the CUDA environment, if your client only has a pile of 1U racks then he can at least run it, if he replaces a few of them with some Tesla [nvidia.com] racks, things will speed up a lot.

        I did some programming at college, I do not claim to know anything about the workings of Tesla or CUDA, but it sure sounds rosy if

      • "I don't see CUDA becoming big in gaming circles anytime soon." ... until Aegis PhysX is ported to CUDA. Thus enabling every single G80 and higher card to also turn into a physics accelerator. Yeah, gamers won't go for that shit at all.

        "Third, CUDA still has a *lot* of limitations. It's not as easy to develop a physics engine for a GPU using CUDA as it is for a general purpose CPU."

        Guess we'll see.

        http://en.wikipedia.org/wiki/PhysX [wikipedia.org]
    • by mpbrede ( 820514 )

      There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.
      That's called wasted CPU cycles.
      Actually, synchronization and waiting does not necessarily equate to wasted cycles. Only if the waiting thread holds the CPU to the exclusion of other tasks does it equate to waste. In other cases it equates to multiprogramming - a familiar concept.
    • by Machtyn ( 759119 )

      That's called wasted CPU cycles.
      Just like what is happening when I scroll through these comments

      /it's funny, laugh.
    • Not really...normally, your process goes to sleep during this time. Your CPU spends its cycles doing other things.
      • Not really...normally, your process goes to sleep during this time. Your CPU spends its cycles doing other things.
        From the point of view of the game engine the cycles are wasted.
        • Well, your operating system can either schedule all those other background tasks that run when the time slice granted to the game's threads are up, or it can stop them mid flight. It doesn't matter too much. Plus, the game may have other threads running which are not blocked on the synchronization.
  • Topic is rather interesting, especially for game developers, among whom I sometimes lurk , but what's the point of simplifying descriptions and problems up to the point of being meaningless and useless ?
    • Re: (Score:2, Insightful)

      by mrbluze ( 1034940 )

      but what's the point of simplifying descriptions and problems up to the point of being meaningless and useless ?
      This isn't information, it's advertising. The target audience is teenagers with wealthy parents who will buy the NVIDIA cards for them.
  • This is just hype, it is well known that for real high-performance applications cuda is compute-bound, i.e. a lot of bandwidth is waste. Cuda is just another platform for niche applications, never to compete with commodity processors.
  • So make it all synchronize to the lowest fps, the video of course.. We are talking about one game object after all.

    In real application, the audio/video must be calculated for many of objects, and it is a static 30 or 60 fps video, and always static samples per second audio, perhaps cd quality 44100 samples per second but likely less.

    This synchronization is not unsolved. Every slice of game time is divided between how many $SampleRate frames of audio divided by game objects producing audio, and how
    • This tells me nothing. Why would you want a game (Common single threaded-programmed application) to compete with your divx compression and ray tracing bryce3d application running in the background? Are they (Intel, AMD, IBM) all saying that we need to hook up 8 or 12 or 24 processor cores at 3ghz each to get an actual speed of 4ghz while each one waits around wasting processing cycles to get something to do? That is the lamest thing I've heard in a long time. I'd much rather have a SINGLE CORE Graphene p
      • Even though I think this is a very speculative and information free article, if you imagine it in the domain of the PS3 console for example, where any time a core is not doing anything useful it is wasting potential, I guess you could see where they're coming from.

        At least that is the idea I had while reading it, I wasn't thinking about running other cpu intensive PC apps at the same time as a game.
      • by rdebath ( 884132 )

        But you can't have a 12GHz, at that speed light goes about ONE INCH per clock cycle in a vacuum, anything else is slower, signals in silicon are a lot slower.

        So much slower that a modern single core processor will have a lot of "execution units" to keep up with the instructions arriving at the 3GHz rate these instructions are handed off to the units in parallel and the results drop out of the units "a few" clock cycles later. This is good except when the result of UnitA is needed before UnitB can start.

        • Re: (Score:3, Informative)

          by TheRaven64 ( 641858 )

          But you can't have a 12GHz, at that speed light goes about ONE INCH per clock cycle in a vacuum, anything else is slower, signals in silicon are a lot slower.

          An inch is a long way on a CPU. A Core 2 die is around 11mm along the edge, so at 12GHz a signal could go all of the way from one edge to the other and back. It uses a 14-stage pipeline, so every clock cycle a signal needs to travel around 1/14th of the way across the die, giving around 1mm. If every signal needs to move 1mm per cycle and travels at the speed of light, then your maximum clock speed is 300GHz.

          Of course, as you say, electric signals travel a fair bit slower in silicon than photons do

          • I think that you've oversimplified a tad too much. You are assuming instant switching time on your gates. Sure light could propagate that fast in a vacuum, and electrons in a wire could do some comparable %. But a pipeline stage may have a combinatorial depth of several hundred gates and once you subtract their switching time signal propagation is a serious problem. The current range of Core2s has to use lots of fancy tricks (like asynchronous timing domains) to get around clock-skew at 3Ghz on a 11mm squar
  • by master_p ( 608214 ) on Saturday May 03, 2008 @06:27AM (#23283212)

    Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform [wikipedia.org] that solved the problem of parallelism, in many ways similar to CUDA.

    Ironically, some of the first demos I saw using transputers was raytracing demos [classiccmp.org].

    The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.

    • Re: (Score:3, Interesting)

      by ratbag ( 65209 )
      That takes me back. My MSc project in 1992 was visualizing 3D waves on Transputers using Occam. Divide the wave into chunks, give each chunk to a Transputer, pass the edge case between the Transputers and let one of them look after the graphics. Seem to recall there were lots of INs and OUTs. A friend of mine simulated bungie jumps using similar code, with a simple bit of finite element analysis chucked in (the rope changed colour based on the amount of stretch).

      Happy Days at UKC.
    • by Fallen Andy ( 795676 ) on Saturday May 03, 2008 @07:54AM (#23283432)
      Back in the early 80's I was working in Bristol UK for TDI (who were the UCSD p-system licensees) porting it to various machines... Well, we had one customer who wanted a VAX p-system so we trotted off to INMOS's office and sat around in the computer room. (VAX 11/780 I think). At the time they were running Transputer simulations on the machine so the VAX p-system took er... about 30 *minutes* to start. Just for comparison an Apple ][ running IV.x would take less than a minute. Almost an hour to make a tape. (About 15 users running emulation I think). Fond memories of the transputer. Almost bought a kit to play with it... Andy
  • by maillemaker ( 924053 ) on Saturday May 03, 2008 @06:29AM (#23283220)
    When I came up through my CS degree, object-oriented programming was new. Programming was largely a series of sequentially ordered instructions. I haven't programmed in many years now, but if I wanted to write a parallel program I would not have a clue.

    But why should I?

    What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO.

    All computer programming languages, and even just plain applications, are abstractions from the computer hardware. What is needed are more robust abstractions to make programming for multiple processors (or cores) easier and more intuitive.
    • Re: (Score:1, Insightful)

      by Anonymous Coward
      Erlang?
    • I can agree with that. Any error that crashes 1 out of 20 or so concurrent threads, on multiple cores, using shared cache, is too complex for a mere human to figure out. After 30+ years programming single threaded applications, it will take a lot of new tools to make this happen.
    • Re: (Score:3, Interesting)

      by TheRaven64 ( 641858 )
      There's only so much that a compiler can do. If you structure your algorithms serially then a compiler can't do much. If you write parallel algorithms then it's relatively easy for the compiler to turn it into parallel code.

      There are a couple of approaches that work well. If you use a functional language, then you can use monads to indicate side effects and the compiler can implicitly parallelise the parts that are free from side effects. If you use a language like Erlang or Pict based on a CSP or a

    • by Kupfernigk ( 1190345 ) on Saturday May 03, 2008 @08:26AM (#23283538)
      The approach used by Erlang is interesting as it is totally dependent on message passing between processes to achieve parallelism and synchronisation. To get real time performance, the message passing must be very efficient. Messaging approaches are well suited to parallelism where the parallel process are themselves CPU and data intensive, which is why they work well for cryptography and image processing. From this point of view alone, a parallel architecture using GPUs with very fast intermodule channels looks like a good bet.

      The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.

      So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.

      I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.

      • Re: (Score:2, Interesting)

        by jkndrkn ( 1092917 )
        > and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. While shared-memory tools like UPC and OpenMP are gaining ground (especially with programmers), I too feel that they are a step backwards. Message passing languages, especially Erlang, are better designed to cope with the unique challenges of computing on a large parallel computer due to their excellent fault tolerance features.

        You might be interested in some w
        • It doesn't surprise me in the slightest. Erlang is designed from the ground up for pattern matching rather than computation, because it was designed for use in messaging systems - telecoms, SNMP, now XMPP. Its integer arithmetic is arbitrary precision, which prevents overflow in integer operations at the expense of performance. Its floating point is limited. My early work on a 3-way system used hand coded assembler to drive the interprocess messaging using hardware FIFOs, for Pete's sake, and that was as hi
    • by maraist ( 68387 ) * <michael...marais ... l...n0spam...com> on Saturday May 03, 2008 @09:51AM (#23283880) Homepage
      Consider that if you've ever done UNIX programming, you've been doing MT programming all along - just by a different name.. Multi-Processing. Pipelines are, in IMO the best implementation of parallel programming (and UNIX is FULL of pipes). You take a problem and break it up into wholly independent stages, then multi process or multi-thread the stages. If you can split the problem up using message-passing then you can farm the work out to decoupled processes on remote machines, and you get farming / clustering. Once you have the problem truely clustered, then multi-threading is just a cheaper implementation of multi-processing (less overhead per worker, less number of physical CPUs, etc).

      Consider this parallel programing pseudo-example

      find | tar | compress | remote-execute 'remote-copy | uncompress | untar'

      This is a 7 process FULLY parallel pipeline (meaning non-blocking at any stage - every 512 bytes of data passed from one stage to the next gets processed immediately). This can work with 2 physical machines that have 4 processing units each, for a total of 8 parallel threads of execution.

      Granted, it's hard to construct a UNIX pipe that doesn't block.. The following variation blocks on the xargs, and has less overhead than separate tar/compress stages but is single-threaded

      find name-pattern | xargs grep -l contents-pattern | tar-gzip | remote-execute 'remote-copy | untar-unzip'

      Here the message-passing are serialized/linearized data.. But that's the power of UNIX.

      In CORBA/COM/GNORBA/Java-RMI/c-RPC/SOAP/HTTP-REST/ODBC, your messages are 'remoteable' function calls, which serialize complex parameters; much more advanced than a single serial pipe/file-handle. They also allow synchronous returns. These methodologies inherently have 'waiting' worker threads.. So it goes without saying that you're programming in an MT environment.

      This class of Remote-Procedure-Calls is mostly for centralization of code or central-synchronization. You can't block on a CPU mutex that's on another physically separate machine.. But if your RPC to a central machine with a single variable mutex then you can.. DB locks are probably more common these days, but it's the exact same concept - remote calls to a central locking service.

      Another benifit in this class of IPC (Inter Process Communication) is that a stage or segment of the problem is handled on one machine.. BUt a pool of workers exists on each machine.. So while one machine is blocking, waiting for a peer to complete a unit of work, there are other workers completing their stage.. At any given time on every given CPU there is a mixture of pending and processing threads. So while a single task isn't completed any faster, a collection of tasks takes full advantage of every CPU and physical machine in the pool.

      The above RPC type models involve explicit division of labor. Another class are true opaque messages.. JMS, and even UNIX's 'ipcs' Message Queues. In Java it's JMS. The idea is that you have the same workers as before, but instead of having specific UNIQUE RPC URI's (addresses), you have a common messaging pool with a suite of message-types and message-queue-names. You then have pools of workers that can live ANYWHERE which listen to their queues and handle an array of types of pre-defined messages (defined by the application designer). So now you can have dozens or hundreds of CPUs, threads, machines all symmetriclly passing asynchronous messages back and forth.

      To my knowledge, this is the most scaleable type of problem.. You can take most procedural problems and break them up into stages, then define a message-type as the explicit name of each stage, then divide up the types amongst different queues (which would allow partitioning/grouping of computational resources), then receive-message/process-message/forward-or-reply-message. So long as the amount of work far exceeds the overhead of message passing, you can very nicely scale with the amount of hardware you can throw at the problem.
      • Consider that if you've ever done UNIX programming, you've been doing MT programming all along - just by a different name.. Multi-Processing. Pipelines are, in IMO the best implementation of parallel programming (and UNIX is FULL of pipes).

        Unix pipes are a very primitive example of a dataflow language [wikipedia.org].

      • Re: (Score:3, Insightful)

        by philipgar ( 595691 )
        While you make some good points in your comment, there are parts that are off. First, UNIX pipes are not an effective way to parallelize an application. UNIX pipes provide a method that tends to be inefficient, and will involve much "needless" copying of data (from your application to the pipe, the OS will then read in the data and write it to the other process which will then likely read the data into its address space). Additionally, UNIX pipes work well for steady state, but tend to have problems with
      • by Effugas ( 2378 ) *
        It's an interesting example you raise. Lets take a look at your example:

        find | tar | compress | remote-execute 'remote-copy | uncompress | untar'

        find -- you're sweeping the file system and comparing against rules. Maybe IO-driven CPU, at best.
        tar -- You're appending a couple headers. No work.
        compress -- OK, here there's a CPU bound.
        remote-execute remote-copy -- Throwing stuff onto the network and pulling it off.
        uncompress -- OK, more CPU bound.
        untar -- Now you're adding files to the file system, but onl
    • When I came up through my CS degree, object-oriented programming was new. Programming was largely a series of sequentially ordered instructions. I haven't programmed in many years now, but if I wanted to write a parallel program I would not have a clue.

      But why should I?

      What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO.

      Crikey, when was your CS degree? Mine was a long time ago, yet I still learned parallel programming concepts (using the occam [wikipedia.org] language).

      • I actually finished my degree in 2005, but I took all my CS classes from 1992-1997. I started college in 1988.

    • by Kjella ( 173770 )
      Yes, they could be better but the problem isn't going to go away entirely. When you run a single-threaded application A to Z you only need to consider sequence. When you try to make a multi-threaded application you have to not only tell it about sequence but also the choke points where the state must be consistent. There are already languages to make it a lot easier to fit into the "pool" design pattern where you have a pool of tasks and a pool of resources (threads) to handle it, which works when you got s
  • Uh, what a crap (Score:4, Informative)

    by udippel ( 562132 ) on Saturday May 03, 2008 @06:36AM (#23283240)
    "News for Nerds, Stuff that matters".
    But not if posted by The Ignorant.

    What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.

    If a student of mine wrote this, a Fail will be the immediate consequence. How can 400 fps be 'only'? And why is threading bad, if the character movement is ready after 1/400 second? There is not 'a lot of waiting'; instead, there are a lot of cycles to calculate something else. and 'waiting' is not 'synchronisation'.
    [The audio-rate of 7000 fps gave the author away; and I stopped reading. Audio does not come in fps.]

    While we all agree on the problem of synchronisation in parallel programming, and maybe especially in the gaming world, we should not allow uninformed blurb on Slashdot.
    • Samples per second would be more accurate.
    • by maxume ( 22995 )
      I'm pretty sure it means "fixed at 400 fps" rather than "just 400 fps".
    • by Quixote ( 154172 ) *
      While I agree that the "article" was by a nitwit, I do have to quibble about something you wrote.

      How can 400 fps be 'only'?

      You are responding to the following (hypothetical) statement:
      but it can be rendered at only 400 fps

      Which is different from the one written:
      but it can only be rendered at 400 fps

      See the difference?

  • by nguy ( 1207026 )
    Except for being somewhat more cumbersome to program and less parallel than previous hardware, there is nothing really new about the nVidia parallel programming model. And their graphics-oriented approach means that their view of parallelism is somewhat narrow.

    Maybe nVidia will popularize parallel programming, maybe not. But I don't see any "shake up" or break throughs there.
  • Why not start again with a massively parallel GPU, skipping all the years of catchup that will be necessary with multi-core cpu's. Make an OS for your chips...
  • NVidia is one of the major voices in the Khronos Group, the organization that promised to release the OpenGL 3.0 API over six months ago. The delay is embarrassing, and many are turning to DirectX.

    It occurs to me that NVidia may not want OpenGL to succeed. Maybe they're holding up OpenGL development to give CUDA a place in the sun. Does anyone else get the same impression?
    • by mikael ( 484 )
      Delays are mainly due to disagreements between different vendors rather than any one company wanting to slow the show down.

      Look at the early OpenGL registry extension specifications - vendors couldn't even agree on what vector arithmetic instructions to implement.
    • Re: (Score:3, Insightful)

      by johannesg ( 664142 )
      NVidia has every reason to want OpenGL to succeed - if it doesn't, Microsoft will rule supreme over the API to NVidia's hardware, and that isn't a healthy situation to be in. As it is, OpenGL gives them some freedom to do their own thing.

      However, having mentioned Microsoft... If *someone* does want OpenGL to succeed it is them... If and when OpenGL 3.0 ever appears, I bet there will be some talk of some "unknown party" threatening patent litigation...

      Destroying OpenGL is of paramount important to Microsoft,
  • by njord ( 548740 ) on Saturday May 03, 2008 @08:53AM (#23283632)
    From my experience, CUDA was much harder to take advantage of then multi-core programming. CUDA requires you to use a specific model of programming that can make it difficult to take advantage of the full hardware. The restricted caching scheme makes memory management a pain, and the global synchronization mechanism is very crude - there's a barrier after each kernel execution, and that's it. It took me a week to 'parallelize' port some simple code I had written to CUDA, whereas it took my an hour or so to add the OpenMP statements to my 'reference' CPU code. Sorry Nvidia - there is no silver bullet. By making some parts of parallel programming easy, you make others hard or impossible.
    • by ameline ( 771895 )
      Mod parent up... His is one of the best on this topic.
    • Re: (Score:1, Interesting)

      by Anonymous Coward
      You make a good point: The data-parallel computing model used in CUDA is very unfamiliar to programmers. You might read the spec sheet and see "128 streaming processors" and think that is the same as having 128 cores, but it is not. CUDA inhabits a world somewhere between SSE and OpenMP in terms of task granularity. I blame part of this confusion on Nvidia's adoption of familiar sounding terms like "threads" and "processors" for things which behave nothing like threads and processors from a multicore pro
  • by Futurepower(R) ( 558542 ) on Saturday May 03, 2008 @11:49AM (#23284520) Homepage
    Avoid the blog spam. This is the actual article in EE times: Nvidia unleashes Cuda attack on parallel-compute challenge [eetimes.com].

    Nvidia is showing signs of being poorly managed. CUDA [cuda.com] is a registered trademark of another hi-tech company.

    The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.
    • Nvidia is showing signs of being poorly managed. CUDA is a registered trademark of another hi-tech company.

      Who cares? Medical equipment != parallel computation.

  • Multi-threaded programming is a fundamentally hard problem, as is the more general issue of maximally-efficient scheduling of any dynamic resource. No one idea, tool or company is going to "solve" it. What will happen is that lots of individual ideas, approaches, tools and companies will individually address little parts of the problem, making it incrementally easier to produce efficient multi-threaded code. Some of these approaches will work together, others will be in opposition, there will be engineer

  • by JRHelgeson ( 576325 ) on Saturday May 03, 2008 @01:00PM (#23284920) Homepage Journal
    I live in Minnesota, home of the legendary Cray Research. I've met with several old timers that developed the technologies that made the Cray Supercomputer what it was. Hearing about the problems that multi-core developers are facing today reminds me of the stories I heard about how the engineers would have to build massive cable runs from processor board to processor board to memory board just to synchronize the clocks and operations so that when the memory was ready to read or write data, it could tell the processor board... half a room away.

    As I recall:
    The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.

    These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.
  • ... programs are still only as fast as their slowest link.

Quark! Quark! Beware the quantum duck!

Working...