Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Graphics Software Hardware

Five Nvidia CUDA-Enabled Apps Tested 134

crazipper writes "Much fuss has been made about Nvidia's CUDA technology and its general-purpose computing potential. Now, in 2009, a steady stream of launches from third-party software developers sees CUDA gaining traction at the mainstream. Tom's Hardware takes five of the most interesting desktop apps with CUDA support and compares the speed-up yielded by a pair of mainstream GPUs versus a CPU-only. Not surprisingly, depending on the workload you throw at your GPU, you'll see results ranging from average to downright impressive."
This discussion has been archived. No new comments can be posted.

Five Nvidia CUDA-Enabled Apps Tested

Comments Filter:
  • Re:Nice, but... (Score:5, Informative)

    by slummy ( 887268 ) <shawnuthNO@SPAMgmail.com> on Monday May 18, 2009 @06:58PM (#28004455) Homepage
    CUDA is a framework that will work on Windows and Linux.
  • Re:Nice, but... (Score:5, Informative)

    by gustgr ( 695173 ) <gustgrNO@SPAMgmail.com> on Monday May 18, 2009 @06:59PM (#28004463)

    I know you are trolling, but actually CUDA applications work better on Linux than on Windows. If you run a CUDA kernel on Windows that lasts longer than 5~6 seconds, your system will hang. The same will happen on Linux but then you can just disable the X server or have one card providing your graphical display and another one as your parallel co-processor.

  • For folders (Score:4, Informative)

    by esocid ( 946821 ) on Monday May 18, 2009 @07:07PM (#28004539) Journal
    Fold@home [stanford.edu] can use CUDA in linux, but you have to compile the CUDA driver first.
  • SETI? (Score:4, Informative)

    by NiteMair ( 309303 ) on Monday May 18, 2009 @07:14PM (#28004609)

    Waste your GPU cycles on something more interesting than SETI...

    http://www.gpugrid.net/
    http://distributed.net/download/prerelease.php (ok, maybe that's less interesting...)

    And why limit this discussion to CUDA? ATI/AMD's STREAM is usable as well...

    http://folding.stanford.edu/English/FAQ-ATI

  • Re:Tied to a card (Score:2, Informative)

    by Caelius ( 1282378 ) on Monday May 18, 2009 @07:17PM (#28004649)
    Open CL is the open source CUDA alternative. http://en.wikipedia.org/wiki/OpenCL [wikipedia.org]
  • Re:Nice, but... (Score:2, Informative)

    by mikiN ( 75494 ) on Monday May 18, 2009 @07:19PM (#28004669)

    Queue mip-mapped, 8xAA, subpixel rendered, fogged, PhysX enhanced flyby of a 'Whoosh' passing over your head.

    The question was not whether CUDA runs _on_ Linux, but whether the GPU itself can run Linux.

    I can imagine that, if we had ever been given all the specs, a multi-function DSP card like IBM's Mwave could. It would probably even be able to read aloud console messages (besides being a graphics card and modem, it's also a sound card).

  • Comment removed (Score:2, Informative)

    by account_deleted ( 4530225 ) on Monday May 18, 2009 @07:21PM (#28004689)
    Comment removed based on user account deletion
  • h.264 encoding (Score:5, Informative)

    by BikeHelmet ( 1437881 ) on Monday May 18, 2009 @07:31PM (#28004787) Journal

    h.264 encoding didn't improve with more shaders for some of the results(like PowerDirector 7), because of the law of diminishing returns.

    I remember reading about x264 when quad-cores were becoming common. It mentioned that if quality is of the utmost importance, you should still encode on a single core. It splits squares of pixels between the cores; where those squares connect there can be very minor artifacts. It smooths these artifacts out with a small amount of extra data and post processing; the end result is a file hardly 1-2% bigger than if encoded on a single core, but encoded roughly 4x faster.

    Now, if we're talking about 32 cores, or 64, or 128, would the size difference be bigger than 1-2%? Probably. After a certain point, it would almost certainly not be worth it.

    This is supported by Badaboom's results, where the higher resolution videos (with more encoded squares) seem to make use of more shaders when encoding, while most of the lower resolution vids do not. (indicating that some shaders may be lying idle)

    What I'm curious about, is could the 9800GTX encode two videos at once, while the 9600GT could only manage one? ;)

    I'm also curious why the 320x240 video encoded so quickly - but that could be from superior memory bandwidth, shader clockspeed, and some other important factor in h.264 encoding.

    Take it with a grain of salt; I'm not an encoder engineer; just regurgitating what I once read, hopefully accurately. ;)

  • Re:Nice, but... (Score:4, Informative)

    by 3.1415926535 ( 243140 ) on Monday May 18, 2009 @07:35PM (#28004831)

    Folding@Home runs its computations in short bursts. gustgr is talking about a single computation kernel that takes more than 5-6 seconds.

  • Re:Nice, but... (Score:4, Informative)

    by Jah-Wren Ryel ( 80510 ) on Monday May 18, 2009 @07:35PM (#28004837)

    He's not talking about how long the app itself runs, but how long each subroutine that runs on the GPU runs before returning something back to the app on the CPU side. If that subroutine takes too long to complete windows gets unhappy. I don't remember if it was a watchdog timer thing or a bus-locking thing or something else. I don't even know if its been fixed or not.

  • Re:Tied to a card (Score:4, Informative)

    by TheRaven64 ( 641858 ) on Monday May 18, 2009 @07:40PM (#28004879) Journal
    OpenCL is an open standard, but there is not yet an open source implementation. That said, OpenCL is very similar to GLSL, and there is already a GLSL front end for LLVM being worked on by Mesa and Tungsten Graphics, so extending it to support OpenCL should be relatively easy.
  • Re:Tied to a card (Score:4, Informative)

    by Anonymous Coward on Monday May 18, 2009 @07:44PM (#28004923)

    I hear this a lot in CUDA/GPGPU-related threads on slashdot, primarily from people who simply have zero experience with GPU programming. The bottom line is that in the present and for the foreseeable future, if you are going to try to accelerate a program by offloading some of the computation to a GPU, you are going to be tying yourself to one vendor (or writing different versions for multiple vendors) anyways. You simply cannot get anything approaching worthwhile performance from a GPU kernel without having a good understanding of the hardware you are writing for. nVidia has a paper [nvidia.com] that illustrates this excellently, in which they start off with a seemingly good "generic" parallel reduction code and go through a series of 7 or 8 optimizations -- most of them based on knowledge of the hardware -- and improve its performance by more than a factor of 30 versus the generic implementation.

    Another thing to keep in mind is that CUDA is very simple to learn as an API -- if you're familiar with C you can pick up CUDA in an afternoon easily. The difficulty, as I said in the previous paragraph, is optimization; and optimizations that work well for a particular GPU in CUDA will (or at least should) work well for the same GPU in OpenCL.

  • Re:Tied to a card (Score:4, Informative)

    by jared9900 ( 231352 ) on Monday May 18, 2009 @07:45PM (#28004929)

    But OpenCL is a specification, not an implementation. The only 3 implementations I'm currently aware of is Apple's (with Snow Leopard), AMD demoed implementation back in March, and Nvidia's beta implementation. So far none of those are open source. If you're aware of an open source implementation, please let me know I'm actually very interested in it, but have yet to locate one.

  • Re:Tom's Hardware (Score:1, Informative)

    by crazipper ( 1250580 ) on Monday May 18, 2009 @07:58PM (#28005079)
    I'll pass this feedback along to the design guys, but do you *really* want to scroll through 4,000 words and 50-some charts, rather than looking at just the pages you're interested in reading? Surely the length would be a bigger problem if there wasn't an index, right? TBH, I'm most focused on the editorial side of things.
  • Re:Nice, but... (Score:3, Informative)

    by bigstrat2003 ( 1058574 ) * on Monday May 18, 2009 @08:17PM (#28005265)

    I know you are trolling...

    No, he's joking. Stop crying troll when there's not even a hint of troll, for God's sake.

    ...but actually CUDA applications work better on Linux than on Windows.

    Read carefully. He said "does it run Linux?", not "does it run on Linux?". Overused slashdot meme it might be, but the joke still went miles above your head.

  • by Muerte23 ( 178626 ) on Monday May 18, 2009 @08:21PM (#28005299) Journal

    Well I didn't say my code was *well* written. Apparently there's a lot of trickery with copying global memory to cached memory to speed up operations. Cached memory takes (IIRC) one clock cycle to read or write, and global GPU memory takes six hundred cycles. And there's all this whatnot and nonsense about aligning your threads with memory locations that I don't even bother with.

  • Re:Tom's Hardware (Score:3, Informative)

    by linhares ( 1241614 ) on Monday May 18, 2009 @08:33PM (#28005399)
    and seriously, are you talking gpgpu performance or the magical wonders of seti@home, h264, science funding, and so on? So many pages wasted... and of course, much worse: my time wasted on the poetry.

    If you absolutely need this type of wandering off to have more pages and more clicks to survive on the web, then I'm concerned your site may not last for very long. I personally love the site, but these 15-page wonderings off the subject drive me fucking nuts.

  • Re:h.264 encoding (Score:3, Informative)

    by SpazmodeusG ( 1334705 ) on Monday May 18, 2009 @08:46PM (#28005521)
    Data compression is an inherantly serial operation. Parts of it can be done in parrallel but in general the way you compress the next bit is based on the patterns observed earlier.

    Say you wanted one core to start encoding at 0% and the other at 50% of the way into the movie. The core starting at 50% has to start compression without any of the learned patterns in the 0-50% range. In the example you gave one core encodes half the screen and the other core encodes the other half. If they are running in parrallel the second core can't use the learnt patterns of the first unless it wants to wait for the first core to finish its current frame (thereby making it non-parrallel).

    So you have a tradeoff. You can run everything serially, or you can accept that you'll miss a few observed patterns here and there and run more parrallel.
  • by Gary W. Longsine ( 124661 ) on Monday May 18, 2009 @08:56PM (#28005623) Homepage Journal
    It's not really clear what you're looking for, possibly because you're looking for the wrong thing. It might help if you first spend an hour or three learning a little more about OpenCL, and reading up at various sites to see who's doing what.

    OpenCL is an Open Standard compute language which comprises:
    • a language extended from C99,
    • a platform (hardware + OpenCL-aware device driver), and
    • a compiler and runtime (which may decide where to send a compute task at run time).

    If you're writing an OpenCL-aware device device driver for a GPU, you'll probably need to wait a bit for some open source examples. It's reasonably likely that there will be some included in Darwin [apple.com] (once updated for Snow Leopard).

    Look to the LLVM [llvm.org] project (sponsored heavily by Apple and others) for an open source compiler which will (if it doesn't already) know about OpenCL.

    It sounds like you might be looking for a higher level API which allows you to more easily use the OpenCL, or possibly for language bindings to Java or Python perhaps? I suspect you'll see those coming along, once Apple ships Snow Leopard, and people have a chance to kick the tires, and then integrate LLMV into their tool chains, extend various higher level API, bridge to Java and whatnot.

    The earliest high level API to take easy and broad advantage of OpenCL will probably be from Apple, of course. They'll likely provide some nicely automatic ways to take advantage of OpenCL without programming the OpenCL C API directly. As a Cocoa programmer, you'll be using various high level objects, maybe an indexer for example, which have been taught new OpenCL tricks. You'll just recompile your program and it will tap the GPU as appropriate and if available. The Cocoa implementation is closed source, but people will see what's possible and emulate it in various open source libraries, on other platforms, for Java and other languages.

    Here's a good place to start: OpenCL - Parallel Computing on the GPU and CPU [ucdavis.edu]. Follow up with a google search.

  • Re:Nice, but... (Score:5, Informative)

    by Jah-Wren Ryel ( 80510 ) on Monday May 18, 2009 @09:01PM (#28005655)

    Uhh...Cray is still very much alive. And doing vectors. And threads. And multicore. All long before Intel/AMD.

    Seymour Cray was killed by a speeding redneck in a trans-am in 1996.

    The company currently known as Cray as formerly known as TERA, which bought the assets of Cray Research from SGI who acquired Cray Research after Seymour had left to form Cray Computer which is also defunct.

    Seymour was never significantly involved in multi-core or multi-threaded processors or NUMA. In fact, he specifically avoided designs even hinting of that sort of complexity because he felt that simplicity in design made it easier to fully utilize the maximum performance of the hardware.

  • Re:h.264 encoding (Score:5, Informative)

    by SpazmodeusG ( 1334705 ) on Monday May 18, 2009 @09:04PM (#28005685)
    Encoding from multiple different keyframes works when you can seek to any part of the input video but it doesn't help with realtime encoding.

    If i'm encoding a signal in realtime from TV i have to start encoding at 0% onwards. The only way to parallelize it is to split the individual frames up into boxes (as done by the Badaboom).
  • Re:h.264 encoding (Score:3, Informative)

    by electrosoccertux ( 874415 ) on Tuesday May 19, 2009 @12:08AM (#28006905)

    Data compression is an inherantly serial operation. Parts of it can be done in parrallel but in general the way you compress the next bit is based on the patterns observed earlier.

    Say you wanted one core to start encoding at 0% and the other at 50% of the way into the movie. The core starting at 50% has to start compression without any of the learned patterns in the 0-50% range. In the example you gave one core encodes half the screen and the other core encodes the other half. If they are running in parrallel the second core can't use the learnt patterns of the first unless it wants to wait for the first core to finish its current frame (thereby making it non-parrallel).

    So you have a tradeoff. You can run everything serially, or you can accept that you'll miss a few observed patterns here and there and run more parrallel.

    For usability (seeking through a video) no codecs worked based on a learned pattern. The memory requirements to make use of this would be astronomical (you'd have to store the entire file in RAM, good luck doing that with a BluRay).

    IIRC, the furthest back any codec looks is something like 24 frames.

  • by parlancex ( 1322105 ) on Tuesday May 19, 2009 @12:11AM (#28006919)
    Actually, what you are referring to is simultaneous DMA and kernel execution, and this is available in every card that has compute 1.1 capability which is actually every card but the very first G80 series cards (8800 GTX and 8800 GTS). The GPU actually executes the DMA and pulls memory that has been allocated as aligned and pagelocked and this can be overlapped with kernel execution, it doesn't have anything to do with GPU or CPU threads. Transfers from non page-locked memory are always synchronous and as such can't be overlapped with kernel execution. But, generally, yes, host -> device memory bandwidth is usually the bottleneck for most CUDA applications. Applications that are able to perform a large amount of processing on the same data if that data will fit simultaneously in device memory are able to mitigate this, but this doesn't usually include supercomputing or general coprocessor-esque applications (transcoding).
  • Re:h.264 encoding (Score:2, Informative)

    by Anonymous Coward on Tuesday May 19, 2009 @12:16AM (#28006971)

    For video encoding there is a ton of work that can be done in parallel. You can compute all of the dct's for all of the macroblocks in parallel. You can run your motion search for every block in parallel.

  • Re:h.264 encoding (Score:3, Informative)

    by adolf ( 21054 ) <flodadolf@gmail.com> on Tuesday May 19, 2009 @01:05AM (#28007391) Journal

    This is one of the most inane thought patterns I have yet to witness this week.

    The reason is simple: Fine, so you've split a process into chunks and distributed them across two or more cores. But it's not exactly like those cores are working in a vacuum; they all use the same RAM.

    As another reply has stated, codecs don't work quite how you describe -- they don't use the entire media as a reference, but at most a couple of dozen frames. But even if such mythological technology were really in use: There's no qualitative reason why something learned by process A cannot be shared with process B, and vice-versa. Therefore, the two processes can encode totally different segments of a given video, share what they've learned, and make similar and consistent tradeoffs.

    After that, you join the parts on an existing keyframe (which doesn't have to be exactly at 50% or whatever the ideal number happens to be), and call it a day.

  • Re:Nice, but... (Score:3, Informative)

    by AmiMoJo ( 196126 ) on Tuesday May 19, 2009 @05:21AM (#28008935) Homepage Journal

    Presumably it's some kind of issue with CUDA because running code on ATI GPUs does not seem to have this problem. Also, multiple GPUs are supported by apps like Elcomsoft's Wireless Password Recovery on Windows.

    It should be fixable anyway, since modern GPUs are massively parallel and desktop stuff only needs only a fraction of the available processing, even if it's just a case of setting a few stream processors aside.

  • by mdarksbane ( 587589 ) on Tuesday May 19, 2009 @09:48AM (#28010949)

    And as someone who has worked in GLSL (which is a similar level of abstraction as OpenCL) I can say you'll still see major differences even between cards from the same vendor.

    I remember several minor tweaks in our code that gave 20% performance boosts on one card and 20% loss on another, and that was without ever actually getting into the assembler. Video games already often have largely different rendering paths for different cards when it comes to specific shader effects.

8 Catfish = 1 Octo-puss

Working...