Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Graphics Software Hardware

Five Nvidia CUDA-Enabled Apps Tested 134

crazipper writes "Much fuss has been made about Nvidia's CUDA technology and its general-purpose computing potential. Now, in 2009, a steady stream of launches from third-party software developers sees CUDA gaining traction at the mainstream. Tom's Hardware takes five of the most interesting desktop apps with CUDA support and compares the speed-up yielded by a pair of mainstream GPUs versus a CPU-only. Not surprisingly, depending on the workload you throw at your GPU, you'll see results ranging from average to downright impressive."
This discussion has been archived. No new comments can be posted.

Five Nvidia CUDA-Enabled Apps Tested

Comments Filter:
  • by mikiN ( 75494 )


    All fine and dandy, but...does it run Linux?

    • Re:Nice, but... (Score:5, Informative)

      by slummy ( 887268 ) <shawnuthNO@SPAMgmail.com> on Monday May 18, 2009 @06:58PM (#28004455) Homepage
      CUDA is a framework that will work on Windows and Linux.
      • Re: (Score:2, Informative)

        by mikiN ( 75494 )

        Queue mip-mapped, 8xAA, subpixel rendered, fogged, PhysX enhanced flyby of a 'Whoosh' passing over your head.

        The question was not whether CUDA runs _on_ Linux, but whether the GPU itself can run Linux.

        I can imagine that, if we had ever been given all the specs, a multi-function DSP card like IBM's Mwave could. It would probably even be able to read aloud console messages (besides being a graphics card and modem, it's also a sound card).

        • Queue mip-mapped, 8xAA, subpixel rendered, fogged, PhysX enhanced flyby of a 'Whoosh' passing over your head.

          What, this thing runs on AA batteries? Sweet.

          And as a side note, unless you were talking about a long line of whooshes, the word you were looking for is "cue".

    • Re:Nice, but... (Score:5, Informative)

      by gustgr ( 695173 ) <rondinaNO@SPAMgmail.com> on Monday May 18, 2009 @06:59PM (#28004463) Homepage

      I know you are trolling, but actually CUDA applications work better on Linux than on Windows. If you run a CUDA kernel on Windows that lasts longer than 5~6 seconds, your system will hang. The same will happen on Linux but then you can just disable the X server or have one card providing your graphical display and another one as your parallel co-processor.

      • by Anpheus ( 908711 )

        Are you certain this is the case?

        I'm curious because ATI/AMD appear to have solved that problem, in that I can run the Folding@Home GPU client and my displays still run. I'm running Windows 7 with Aero, so it's hitting the GPU not the CPU for my displays.

        • Re:Nice, but... (Score:4, Informative)

          by 3.1415926535 ( 243140 ) on Monday May 18, 2009 @07:35PM (#28004831)

          Folding@Home runs its computations in short bursts. gustgr is talking about a single computation kernel that takes more than 5-6 seconds.

        • Re:Nice, but... (Score:4, Informative)

          by Jah-Wren Ryel ( 80510 ) on Monday May 18, 2009 @07:35PM (#28004837)

          He's not talking about how long the app itself runs, but how long each subroutine that runs on the GPU runs before returning something back to the app on the CPU side. If that subroutine takes too long to complete windows gets unhappy. I don't remember if it was a watchdog timer thing or a bus-locking thing or something else. I don't even know if its been fixed or not.

          • by Anpheus ( 908711 )

            Thanks for the clarification, as well.

          • Re: (Score:3, Informative)

            by AmiMoJo ( 196126 )

            Presumably it's some kind of issue with CUDA because running code on ATI GPUs does not seem to have this problem. Also, multiple GPUs are supported by apps like Elcomsoft's Wireless Password Recovery on Windows.

            It should be fixable anyway, since modern GPUs are massively parallel and desktop stuff only needs only a fraction of the available processing, even if it's just a case of setting a few stream processors aside.

      • Re: (Score:3, Informative)

        I know you are trolling...

        No, he's joking. Stop crying troll when there's not even a hint of troll, for God's sake.

        ...but actually CUDA applications work better on Linux than on Windows.

        Read carefully. He said "does it run Linux?", not "does it run on Linux?". Overused slashdot meme it might be, but the joke still went miles above your head.

        • I'll give him the benefit of the doubt on does it run Linux" and "does it run on Linux" I read it the same way and didn't notice it until I saw your comment
    • by mikiN ( 75494 ) on Monday May 18, 2009 @07:01PM (#28004479)

      Well, everywhere else in the world, Linux runs the CUDA Toolkit [nvidia.com], so I can imagine that in Soviet Russia, a Beowulf cluster of Nvidia cards run Linux.

  • The war begins. (Score:2, Interesting)

    by XPeter ( 1429763 ) *

    With NVIDIA slowly pushing it's way into the CPU market (CUDA is the first step, in a few years I wouldn't be surprised if Nvidia started developing processors) and Intel trying to cut into NVidia's GPU market share with Larrabee http://en.wikipedia.org/wiki/Larrabee_(GPU) [wikipedia.org], we'll see who can develop outside of their box faster. This is good news for AMD since Intel will be more focused on Nvidia instead of being neck to neck with them in the processor market. Hey, maybe AMD will regain it's power in the se

    • Re: (Score:2, Interesting)

      by David Greene ( 463 )

      It's going to be interesting to see how Larrabee and AMD's Fusion battle it out. With Larrabee, Intel is taking a tightly integrated approach. One can easily imagine that LRBni will be integrated into mainstream CPUs in the not-so-distant future, at which point Intel will argue that no one needs a GPU.

      AMD, on the other hand, is taking he approach of (relatively) loosely-coupled specialized processors. One, the CPU, for general-purpose/integer/branchy code and the GPU for graphics (and HPC?).

      Currently my

    • I'd honestly like to see the two work together to produce some sort of sickeningly powerful rendering setup.

      A processor which was good at preprocessing a scene for maximum performance on the GPU hardware and built-in support for multiple display adapters, plus an on-board chip which handles outputting the resulting images via the digital-link-du-jour.

      This sort of setup would mean that rather than having to update your GPUs every two years (you could just buy another one to run in parallel) - the graphics

    • There's no power in the netbook realm for AMD to regain as it never had any to begin with. The netbook market is 95% Intel and the rest is mainly VIA and a smattering of MIPS and ARM nobody seems to care about.
  • Tied to a card (Score:5, Insightful)

    by ComputerDruid ( 1499317 ) on Monday May 18, 2009 @07:07PM (#28004531)

    What I don't understand is why people hype a technology that is tied to a specific manufacturer of card. If nvidia died tomorrow, we'd have a fair amount of code thats no longer relevant, unless there was some way to design cards that are CUDA-capable but not nvidia.

    Also worth noting that I'd completely forgotten CUDA even ran on windows, as I've only heard it in the context of linux recently.

    • Re:Tied to a card (Score:5, Insightful)

      by gustgr ( 695173 ) <rondinaNO@SPAMgmail.com> on Monday May 18, 2009 @07:12PM (#28004585) Homepage

      OpenCL will hopefully help to set a solid ground for GPU and CPU parallel computing, and since it is not technically very different from CUDA, porting existing applications to OpenCL will not be a challenge. Nowadays with current massively parallel technology the hardest part is making the algorithms parallel, not programming any specific device.

      • This of course assumes that OpenCL is able to make a foothold and has support from the hardware and gets some software that really shows the improvements that other developers can get using it.

        Without those it wont have enough traction/mindshare.

    • by egr ( 932620 )
      I think there was an open source alternative which is not tied to any card, but I forgot what its name was. And I never programmed for it, so I don't know how well it preforms.
      • Re: (Score:2, Informative)

        by Caelius ( 1282378 )
        Open CL is the open source CUDA alternative. http://en.wikipedia.org/wiki/OpenCL [wikipedia.org]
        • OpenCL is not open source, OpenCL is a specification for a CUDA-equivalent language and API. Drivers are still necessary, and will likely be produced by the makers of the graphics hardware (ATI, Nvidia, Intel). Open source drivers and compilers are certainly possible, but I wouldn't expect them to be equivalent to the closed source stuff for sometime yet.

        • Re:Tied to a card (Score:4, Informative)

          by TheRaven64 ( 641858 ) on Monday May 18, 2009 @07:40PM (#28004879) Journal
          OpenCL is an open standard, but there is not yet an open source implementation. That said, OpenCL is very similar to GLSL, and there is already a GLSL front end for LLVM being worked on by Mesa and Tungsten Graphics, so extending it to support OpenCL should be relatively easy.
    • Re:Tied to a card (Score:4, Informative)

      by Anonymous Coward on Monday May 18, 2009 @07:44PM (#28004923)

      I hear this a lot in CUDA/GPGPU-related threads on slashdot, primarily from people who simply have zero experience with GPU programming. The bottom line is that in the present and for the foreseeable future, if you are going to try to accelerate a program by offloading some of the computation to a GPU, you are going to be tying yourself to one vendor (or writing different versions for multiple vendors) anyways. You simply cannot get anything approaching worthwhile performance from a GPU kernel without having a good understanding of the hardware you are writing for. nVidia has a paper [nvidia.com] that illustrates this excellently, in which they start off with a seemingly good "generic" parallel reduction code and go through a series of 7 or 8 optimizations -- most of them based on knowledge of the hardware -- and improve its performance by more than a factor of 30 versus the generic implementation.

      Another thing to keep in mind is that CUDA is very simple to learn as an API -- if you're familiar with C you can pick up CUDA in an afternoon easily. The difficulty, as I said in the previous paragraph, is optimization; and optimizations that work well for a particular GPU in CUDA will (or at least should) work well for the same GPU in OpenCL.

      • That's the whole point of of the OpenCL architecture, to let the compiler figure out the hardware specific optimizations. If you want a cross platform, GPU-independent mechanism to:

        [ _Booming_ _Monster_ _Truck_ _Voice_]
        Tap the hidden potential of your GPU! then you want OpenCL.
    • Re: (Score:2, Interesting)

      In general, it's not tied to a card. CUDA itself might be NVIDIA-dependent, but general-purpose GPU programming is not, and other manufacturers will have similar interfaces to GP-GPU programming, eventually.

      As for my own experience with it... everyone at work is going crazy over them. One of our major simulations implements a high-fidelity IR scene modeler. It used to take 2 seconds per frame on CPU-only. They re-wrote it with GPU and got it down to 12 ms.

      Anything that is highly parallelizable with low

    • That's where abstraction and specialization comes into play. After defining your algorithm for independent use, specialize and optimize it to exploit current or future hardware. This gives you a fallback for calculation, and extremely enhanced performance for the life and support of said hardware. And, as others have pointed out, it's a stepping stone to an OpenCL implementation, eventually giving you multiple vendors to rely on.

      If NVIDIA goes out of business or drops support in two years, how much more wor

    • Re: (Score:3, Interesting)

      by CAIMLAS ( 41445 )

      How is this different than AMD-v, which Intel licenses for their virtualization (or maybe I'm confusing it with a64, which Intel licenses)?

      Either way, if AMD "died tomorrow", the same thing would happen as would happen if Nvidia did: some other company, likely a previous competitor, would buy up the technology, and things would continue with barely a hickup.

      A product or technology does not need to be open source or 'standards based' to gain wild adoption. Sometimes, a technology speaks for itself. After all

  • For folders (Score:4, Informative)

    by esocid ( 946821 ) on Monday May 18, 2009 @07:07PM (#28004539) Journal
    Fold@home [stanford.edu] can use CUDA in linux, but you have to compile the CUDA driver first.
  • SETI? (Score:4, Informative)

    by NiteMair ( 309303 ) on Monday May 18, 2009 @07:14PM (#28004609)

    Waste your GPU cycles on something more interesting than SETI...

    http://distributed.net/download/prerelease.php (ok, maybe that's less interesting...)

    And why limit this discussion to CUDA? ATI/AMD's STREAM is usable as well...


    • As of now, though, nvidia's CUDA has all of the hype, as well as a handful of applications developed for the platform.
  • The same way the DoD payed for the Cray supercomputers, gamers are paying for the GPUs. Science dropped by and said thanks.

  • For those out of work since the millenium bug, at long last FORTRAN is back: http://www.nvidia.com/object/cuda_what_is.html [nvidia.com]

    Can't wait for the APL support. Reorganising my keyboard keys in anticipation.

    • Re: (Score:1, Funny)

      by Anonymous Coward

      Back? You've never been in a Physics department, have you? Fortran was never gone.

  • h.264 encoding (Score:5, Informative)

    by BikeHelmet ( 1437881 ) on Monday May 18, 2009 @07:31PM (#28004787) Journal

    h.264 encoding didn't improve with more shaders for some of the results(like PowerDirector 7), because of the law of diminishing returns.

    I remember reading about x264 when quad-cores were becoming common. It mentioned that if quality is of the utmost importance, you should still encode on a single core. It splits squares of pixels between the cores; where those squares connect there can be very minor artifacts. It smooths these artifacts out with a small amount of extra data and post processing; the end result is a file hardly 1-2% bigger than if encoded on a single core, but encoded roughly 4x faster.

    Now, if we're talking about 32 cores, or 64, or 128, would the size difference be bigger than 1-2%? Probably. After a certain point, it would almost certainly not be worth it.

    This is supported by Badaboom's results, where the higher resolution videos (with more encoded squares) seem to make use of more shaders when encoding, while most of the lower resolution vids do not. (indicating that some shaders may be lying idle)

    What I'm curious about, is could the 9800GTX encode two videos at once, while the 9600GT could only manage one? ;)

    I'm also curious why the 320x240 video encoded so quickly - but that could be from superior memory bandwidth, shader clockspeed, and some other important factor in h.264 encoding.

    Take it with a grain of salt; I'm not an encoder engineer; just regurgitating what I once read, hopefully accurately. ;)

    • Re: (Score:3, Informative)

      Data compression is an inherantly serial operation. Parts of it can be done in parrallel but in general the way you compress the next bit is based on the patterns observed earlier.

      Say you wanted one core to start encoding at 0% and the other at 50% of the way into the movie. The core starting at 50% has to start compression without any of the learned patterns in the 0-50% range. In the example you gave one core encodes half the screen and the other core encodes the other half. If they are running in parra
      • I know almost nothing about data compression beyond the readme for pkzip. Are there really enough learned patterns in a video stream that would make a >1% difference in filesize if compressed in independent chunks? As far as I can reason it out, independent chunks would act like you'd just inserted an extra keyframe at the splitpoints.

      • Re: (Score:3, Informative)

        Data compression is an inherantly serial operation. Parts of it can be done in parrallel but in general the way you compress the next bit is based on the patterns observed earlier.

        Say you wanted one core to start encoding at 0% and the other at 50% of the way into the movie. The core starting at 50% has to start compression without any of the learned patterns in the 0-50% range. In the example you gave one core encodes half the screen and the other core encodes the other half. If they are running in parrallel the second core can't use the learnt patterns of the first unless it wants to wait for the first core to finish its current frame (thereby making it non-parrallel).

        So you have a tradeoff. You can run everything serially, or you can accept that you'll miss a few observed patterns here and there and run more parrallel.

        For usability (seeking through a video) no codecs worked based on a learned pattern. The memory requirements to make use of this would be astronomical (you'd have to store the entire file in RAM, good luck doing that with a BluRay).

        IIRC, the furthest back any codec looks is something like 24 frames.

      • Re: (Score:2, Informative)

        by Anonymous Coward

        For video encoding there is a ton of work that can be done in parallel. You can compute all of the dct's for all of the macroblocks in parallel. You can run your motion search for every block in parallel.

      • Re: (Score:3, Informative)

        by adolf ( 21054 )

        This is one of the most inane thought patterns I have yet to witness this week.

        The reason is simple: Fine, so you've split a process into chunks and distributed them across two or more cores. But it's not exactly like those cores are working in a vacuum; they all use the same RAM.

        As another reply has stated, codecs don't work quite how you describe -- they don't use the entire media as a reference, but at most a couple of dozen frames. But even if such mythological technology were really in use: There's

  • by Muerte23 ( 178626 ) on Monday May 18, 2009 @07:45PM (#28004925) Journal

    The Tesla 1060 is a video card with no video output (strictly for processing) that has something like 240 processor cores and 4 GB of DDR3 RAM. Just doing math on large arrays (1k x 1k) I get a performance boost of about a factor of forty over a dual core 3.0 GHz Xeon.

    The CUDA extension set has FFT functionality built in as well, so it's excellent for signal processing. The SDK and programming paradigm is super easy to learn. I only know C (and not C++) and I can't even make a proper GUI, but I can make my array functions run massively in parallel.

    The trick is to minimize memory moving between the CPU and the GPU because that kills performance. Only the brand newest cards support functionality for "simultaneous copy and execute" where one thread can be reading new data to the card, another can be processing, and the third can be moving the results off the card.

    One way that the video people can maybe speed up their processing (disclaimer: I don't know anything about this) is to do a quick sweep for keyframes, and then send the video streams between keyframes to individual processor cores. So instead of each core gets a piece of the frame, maybe each core gets a piece of the movie.

    The days of the math coprocessor card have returned!

    • Re: (Score:2, Interesting)

      by Anonymous Coward
      We've run some signal processing on a Tesla card, and get roughly 500x improvement over (somewhat poorly written) code for a Core 2 Duo.
      ~8 hr on a Core 2 Duo
      ~1.5 hr on Core i7
      seconds on Tesla
      • Re: (Score:3, Informative)

        by Muerte23 ( 178626 )

        Well I didn't say my code was *well* written. Apparently there's a lot of trickery with copying global memory to cached memory to speed up operations. Cached memory takes (IIRC) one clock cycle to read or write, and global GPU memory takes six hundred cycles. And there's all this whatnot and nonsense about aligning your threads with memory locations that I don't even bother with.

    • Re: (Score:3, Informative)

      by parlancex ( 1322105 )
      Actually, what you are referring to is simultaneous DMA and kernel execution, and this is available in every card that has compute 1.1 capability which is actually every card but the very first G80 series cards (8800 GTX and 8800 GTS). The GPU actually executes the DMA and pulls memory that has been allocated as aligned and pagelocked and this can be overlapped with kernel execution, it doesn't have anything to do with GPU or CPU threads. Transfers from non page-locked memory are always synchronous and as
      • Re: (Score:2, Interesting)

        by Belisar ( 473474 )

        I assume that's what the parent meant.

        As an addendum, the newest CUDA 2.2 (with chip of the newest generation, i.e. GT200) actually has support for reading directly from (page-locked) host memory inside of GPU kernels... something I believe ATI cards have allowed for a while.

    • Is that for single or double precision work? Which Xeon exactly? Which compiler? How was the code written for the compiler? Which compiler flags?

      Although I don't dispute your claims, writing to get max performance out the newer xeons is *hard* and you need to be very careful. The 256 bit wide registers on the 54xx can be extremely handy for codes written the right way.

      I currently have a client that needs to run a lot of this [thegpm.org] and so far, I have the single cpu version running 10x faster than the parallel

  • I thought Nvidia was indicating they were going to move to supporting OpenCL, or are the simply planning to support multiple technologies?

    • Both, I'd guess. If someone releases some killer software for OpenCL they'd be made not to - Apple are pushing it for OS X.

      On the other hand, if they do a deal with someone to write CUDA stuff, it's lock-in that you must buy an nvidia card.

      Either way they win...

      • by Trepidity ( 597 )

        They also have control over adding features to CUDA relatively rapidly as hardware gains new capabilities, which they can't easily do with OpenCL.

      • CUDA and OpenCL are not exclusive, they're at different layers in the driver stack. If you look at the NVIDIA slides, you'll see that C, OpenGL, DX11 Compute, and Fortran are all just frontend languages that compile to/run on top of CUDA.

    • I remember reading the OpenCL announcement (I like to pretend that I know what I'm talking about in programming matters) and Nvidia did indeed say that they would be supporting it.

  • by Doc Ruby ( 173196 ) on Monday May 18, 2009 @11:06PM (#28006457) Homepage Journal

    Those benchmarks show that even older ($120-140) nVidia GPU cards can really speed up some processing tasks, especially transcoding video. But what I think is even more exciting than just the acceleration from offloading CPU to GPU is using multiple GPU cards in a single host PC. Stuff a $1000 PC with $1120 in GPUs (like 8 $140 nVidia cards), and that's 1024 parallel cores, anywhere from 16x to 56x the performance at only just over double the price. PCI-e should make the data parallel fast enough to feed the cards. I bet that 8 $1000 cards stuffed into a $1000 PC would be something like 200x to 4000x for only 9x the price.

    So what I want to see is benchmarks for whole render farms. I want to see HD video transcoded into H.264 and other formats simultaneously on the fly, in realtime, with true fast-forward, in multiple independent streams from the same master source. This stuff is possible now on a reasonable budget.

    • by adolf ( 21054 )

      Cool. Sign me up.

      Just one problem: Where can I find a $1000 PC with 8 available PCI Express x16 slots? The best machine I have at the moment only has three, and 8 won't even fit into a normal ATX case.

    • by TubeSteak ( 669689 ) on Tuesday May 19, 2009 @01:51AM (#28007713) Journal

      Those benchmarks show that even older ($120-140) nVidia GPU cards can really speed up some processing tasks, especially transcoding video. But what I think is even more exciting than just the acceleration from offloading CPU to GPU is using multiple GPU cards in a single host PC. Stuff a $1000 PC with $1120 in GPUs (like 8 $140 nVidia cards), and that's 1024 parallel cores, anywhere from 16x to 56x the performance at only just over double the price.

      Your passwords are no longer safe.
      It used to require days for a cluster of PCs to brute force an 8+ character password.
      Now with a big enough PSU, you can stuff a tower with graphics cards to get it done in hours.
      About the only common hash I can't find a CUDA enabled brute forcer for is NTLM2

      • My password is probably safe. It might take hours to crack a single password, but what are the odds that it will be my password, of all the billions of them in use now, of all the dozens of passwords I use, each different?

God doesn't play dice. -- Albert Einstein