Why 'Gaming' Chips Are Moving Into the Server Room 137
Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"
CUDA (Score:4, Informative)
I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.
NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.
OpenCL (Score:3, Informative)
Sounds like a perfect job for OpenCL. When a program is rewritten for OpenCL, you can just drop in CPU's or GPU's and they get used.
Of course not! (Score:3, Informative)
It doesn't take a few libraries and lines of code... It takes a SHITLOAD of libraries and lines of code! - Lone Starr
Re:Libraries (Score:4, Informative)
CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.
http://www.culatools.com/ [culatools.com]
Re:A whole new level of parallelism (Score:1, Informative)
Uh
OpenCL and CUDA supported branching from day one (with a performance hit). Before they existed, there was some (very little) usage of GPUs for general purpose computing and they used GLSL/HLSL/Cg, which supported branching poorly or not at all.
The tools that were recently added to CUDA (for the latest GPUs) are recursion and function pointers.
Re:OpenCL (Score:3, Informative)
Unfortunately, no. OpenCL does not map equally to different compute devices, and does not enforce uniformity of parallelism approaches. Code written in OpenCL for CPUs is not going to be fast on GPUs. Hell, OpenCL code written for ATI GPUs is not going to work well on nVidia GPUs.
Re:A whole new level of parallelism (Score:5, Informative)
Programmers of Server applications are already used to multithreading, and they've been able to make good use of systems with large numbers of processors on them even before the advent of virtualization.
But don't pay too much attention to the word "Server". Yes the machines that they're talking about are in the segment of the market referred to as "servers", as distinct from "desktops" or "mobile". But the target of GPU-based computing isn't "Servers" in the sense of the tasks you normally think of -- web servers, database servers, etc.
The real target is mentioned in the article, and it's HPC, aka scientific computing. Normal server apps are integer code, and depend more on high memory bandwidth and I/O, which GPGPU doesn't really address. HPC wants that stuff too, but they also want floating point performance. As much floating point math performance as you can possibly give them. And GPUs are way beyond what CPUs can provide in that regard. Plus a lot of HPC applications are easier to parallelize than even the traditional server codes, though not all fall in the "embarrassingly parallel" category.
There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing.
Re:Wouldn't a DSP do better? (Score:3, Informative)
Re:Libraries (Score:2, Informative)
It's not as complete as CULA, but for free there is also MAGMA [utk.edu]. Also, nVidia implements a CUDA-accelerated BLAS (CUBLAS) which is free.
As far as OpenCL goes, I don't think there has been much in terms of a good BLAS made yet. The compilers are still sketchy (especially for ATI GPUs), and the performance is lacking on nVidia GPUs compared to CUDA.
Re:A whole new level of parallelism (Score:3, Informative)
I am one such programmer. Yet I also coded for an Nvidia Tesla C1060 board and found it much more straightforward to handle several thousand threads at once.
Not all types of threads are created equal. I usually explain CUDA to people as the "Zerg Rush" model of computing - instead of a couple, well-behaved, intelligent threads that try to be polite to each other and clean up their own messes, you throw a horde of a thousand little vicious, stupid threads at the problem all at once, and rely on some overlord to keep them in line.
Most of the guides explained it as, "Flops are free, bandwidth is expensive." This board had a 384 or 512-bit wide memory bus with a very high latency, and the reason you throw that many threads at it is to let the hardware cover up the latency - it can merge a huge number of memory reads/writes into one operation, and as soon as a thread is waiting on memory I/O it can swap another thread into that same SP and let it compute. If memory serves me, the board was divided into blocks of 8 scalar processors (each block had some scratchpad memory that could be accessed almost as fast as a register) and you wrote groups of 16 threads which ran in lock-step on that processor (no recursion was allowed, and if one branched, the others would just wait around until it reached the same point) in two rounds.
Sure, that's a bit complex to optimize for, but it beats the hell out of conventional threading while trying to optimize for x86 SIMD. And if you manage to write it so it runs well on CUDA, it generally will scale effortlessly to whatever card you throw it at.
It's looking like OpenCL won't be much different, but I have yet to try it. I'm kind of eager, since apparently AMD/ATI's current cards, for the money, have a bit more raw power than Nvidia's.
Re:A whole new level of parallelism (Score:3, Informative)
That said I'd say I wonder if there aren't some architectural limitations on GPUs e.g. memory protection and if we really wanted to use these for general purpose computing and added them would we lose performance? In other words are we just making some kind of cores-to-features tradeoff?
Re:CUDA (Score:2, Informative)
Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API. There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.
If you use CUDA, OpenCL or DirectComputeX it is--try the Kappa library--it has its own scheduling language that make this much easier. The next version that is about to come out goes much further yet.
Re:CUDA (Score:2, Informative)
GCD combined with OpenCL makes it usable on a GPU, but that would be stupid. GPUs aren't really 'threaded' in any context that someone who hasn't worked with them would think of.
All the threads run simultaneously, and side by side. They all start at the same time and they all end at the same time in a batch (not entirely true, but it is if you want to actually get any boost out of it).
GCD is multithreading on a General Processing Unit, like your Intel CoreWhateverThisWeek processor. Code paths are ran and scheduled on different cores as needed and don't really run side by side, but they can run at the same time which is practical and useful in A LOT of cases.
OpenCL is multithreading on a graphics chip. It lets you do the same calculation over and over again or on a very large data set, side by side. You can calculate 128 encryption keys in one pass, but you can't calculate one encryption key, the average of your monthly bills, and draw a circle because the graphics chip doesn't do random processing side by side, it runs a whole bunch of the same instructions side by side and goes to hell in a handbasket the INSTANT you break its ability to run all the 'threads' side by side, executing the same instruction in each at the same time.
I really don't think you understand either standard GP multithreading or what GPUs are practically capable of doing.
Re:How much number-crunching is your server doing? (Score:3, Informative)
No, it's the difference between "efficiency" and what is claimed as "efficient" to get a paper published. That's a really bad citation for AES on GPUs as there is a line of prior work going back to Cook and Cryptographics. In fact that paper is a classic example of getting something into the literature that has already been done. The authors have submitted it to an unrelated conference and failed to cite the relevant work.
If we look at their best figures then throw away the 15x claimed speedup as it doesn't consider memory transfer costs. The 5x speedup is more realistic. The GPU that they use (8800gtx) has 128 stream processors running at 1.35Ghz. The comparison is a PIV running at 3Ghz. Roughly speaking we can compare the cycles taken on each platform as a measure of the work done. The graphics card stream processors perform 57x more clock cycles.
The central workload in AES for high-performance is completely memory bound. The cycles are just used to stage results from memory and perform XOR instructions. So the stream processors only execute the code 5x quicker with 57x more clocks and a huge memory bandwidth advantage that I can't be bothered to look up.
So no, 10x less output per clock is not "efficient" in my book. But if you publish your paper in a crappy unrelated conference then you will get away with it.
Re:Libraries (Score:3, Informative)
The CUDA dev kit includes libraries and examples for BLAS (CUBLAS) and FFT, several LAPACK routines have been implemented in several commercial packages (Jacket, CULA) and free software (MAGMA).
The OpenCL implementation in Mac OS X has FFT and there are libraries for BLAS (from sourceforge) and MAGMA gives you some type of LAPACK implementation.
I work with HPC systems based on nVIDIA GPU's in a research environment - it's still a lot of work (as all research/cluster programs are) but it's certainly doable and can most certainly accelerate some calculations but it depends highly on the application and even more so on the coder.