Forgot your password?
typodupeerror
Power Hardware

Why 'Gaming' Chips Are Moving Into the Server Room 137

Posted by timothy
from the expense-report-manipulation-++ dept.
Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"
This discussion has been archived. No new comments can be posted.

Why 'Gaming' Chips Are Moving Into the Server Room

Comments Filter:
  • Re:CUDA (Score:5, Interesting)

    by Rockoon (1252108) on Thursday July 15, 2010 @04:01PM (#32918630)
    Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API.

    There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.
  • by Dynetrekk (1607735) on Thursday July 15, 2010 @04:04PM (#32918680)
    I'm no expert, but from what I understand, it wouldn't be at all surprising. IBM has been regularly using their Power processors for supercomputers, and the architecture is (largely) the same. The Cell has some extra graphics-friendly floating-point units, but it's not entirely differnent from the CPUs IBM has been pushing for computation in the past. I'm not even sure if the extra stuff in the Cell is interesting in the supercomputing arena.
  • Re:CUDA (Score:3, Interesting)

    by cgenman (325138) on Thursday July 15, 2010 @04:11PM (#32918778) Homepage

    While I don't disagree that NVIDIA needs to make this simpler, is that really a sizeable market for them? Presuming every college will want a cluster of 100 GPU's, they've still got about 10,000 students per college buying these things to game with.

    I wonder what the size of the server room market for something that can't handle IF statements really would be.

  • IIS 3D (Score:2, Interesting)

    by curado (1677466) on Thursday July 15, 2010 @04:14PM (#32918806)
    So.. webpages will soon be available in 3D with anti-aliasing and realistic shading?
  • by 91degrees (207121) on Thursday July 15, 2010 @04:18PM (#32918836) Journal
    So why a GPU rather than a dedicated DSP? Seems they do pretty much the same thing except a GPU is optimised for graphics. A DSP offers 32 or even 64 bit integers, have had 64 bit floats for a while now, allow more flexible memory write positions, and can use the previous results of adjacent values in calculations.
  • by jgagnon (1663075) on Thursday July 15, 2010 @04:23PM (#32918880)

    The problem with "programming for multiple cores/CPUs/threads" is that it is done in very different ways between languages, operating systems, and APIs. There is no such thing as a "standard for multi-thread programming". All the variants share some concepts in common but their implementations are mostly very different from each other. No amount of schooling can fully prepare you for this diversity.

  • RemoteFX (Score:2, Interesting)

    by JorgeM (1518137) on Thursday July 15, 2010 @04:26PM (#32918930)

    No mention of Microsoft's RemoteFX coming in Windows 2008 R2 SP1? RemoteFX uses the server GPU for compression and to provide 3d capabilites to the desktop VMs.

    Any company large enough for a datacenter is looking at VDI and RemoteFX is going to be supported by all of VDI providers except VMware. VDI, not relatively niche case massive calculations, will put GPUs in the datacenter.

  • Re:Notice in TFA (Score:2, Interesting)

    by binarylarry (1338699) on Thursday July 15, 2010 @04:29PM (#32918958)

    Not only that, but they posit that Microsoft's solution solves the issue of both Nvidia's proprietary-ness and the OpenCL boards's "lack of action."

    Fuck this article, I wish I could unclick on it.

  • Re:Crysis 2... (Score:2, Interesting)

    by JorgeM (1518137) on Thursday July 15, 2010 @04:35PM (#32919042)

    I'd love this, actually. My geek fantasy is to be able to run my gaming rig in a VM on a server with a high end GPU which is located in the basement. On my desk in the living room would be a silent, tiny thin client. Additionally, I would have a laptop thin client that I could take out onto the patio.

    On a larger scale, think Steam but with the game running on a server in a datacenter somewhere which would eliminate the need for hardware on the user end.

  • by pslam (97660) on Thursday July 15, 2010 @04:52PM (#32919304) Homepage Journal

    I could almost EOM that. They're massively parallel, deeply pipelined DSPs. This is why people have trouble with their programming model.

    The only difference here is the arrays we're dealing with are 2D and the number of threads is huge (100s-1000s). But each pipe is just a DSP.

    OpenCL and the like are basically revealing these chips for what they really are, and the more general purpose they try to make them, the more they resemble a conventional, if massively parallel, array of DSPs.

    There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?" Well, it always boils down to fundamental complexities in design, and those boil down to the laws of physics. The only way you can get things running this parallel and this fast is to mess with the programming model. People need to learn to deal with it, because all programming is going to end up heading this way.

  • by Anonymous Coward on Thursday July 15, 2010 @05:29PM (#32919758)

    The Cell is a PowerPC processor, which is intimately related with the Power architecture. Basically, PowerPC was an architecture designed by IBM, Apple, and Motorola, for use in high performance computing. It was based in part on an older (now) version of IBM's POWER architecture. In short, POWER was the "core" architecture, and additional instruction sets could be added at fabrication time -- kind of like Intel with their SSE extensions.

    This same pattern continued for a long time. IBM's POWER architecture basically took the PowerPC instruction set, implemented it in new, faster ways. Any interesting extensions might/could be folded into the newer PowerPC architecture revision. The next generation of PowerPC branded chips would inherit the "core" of the last POWER chip's implementation. Later POWER was renamed to Power, to align it with PowerPC branding.

    The neat thing is that the "core" instruction set is pretty powerful. You can run the same Linux binary on a G3 iMac as a Cell as a Gamecube or Wii (in principle) as a as a super computing POWER7 or whatever IBM is up to now, as long as it doesn't need extensions. And you can do a lot of computation without extensions. The "base" is broad, unlike x86's strict hierarchy of modes. In some respects, this doesn't sound so neat, since the computing world has mostly settled on x86 for general purpose computation, and so any new x86 chips will probably include a big suite of extensions to the architecture too. Intel, AMD, and IBM eventually converged on this same RISC-y CISC idea, though IBM/Apple/Motorola managed to expose less of the implementation through its architecture at first.

  • by pclminion (145572) on Thursday July 15, 2010 @06:21PM (#32920338)

    There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?"

    Why should they? Just because not every programmer on the planet can do it doesn't mean there's nobody who can do it. There are plenty of people who can. Find one of these people and hire them. Problem solved.

    Most programmers can't even write single-threaded assembly code any more. If you need some assembly code written, you hire somebody who knows how to do it. I don't see how this is any different.

    As far as whether all programming will head this direction eventually, I don't think so. Most computational tasks are data-bound, and throughput is enhanced by improving the data backends, which are usually handled by third parties. We already don't know how the hell our own systems work. For the people who really need this kind of thing, you need to go out and learn it or find somebody who knows it. Expecting that the whole world can do it is crazy thinking.

  • Have you ever read up on Amdahl's law? [wikipedia.org]

    I'll see your Amdahl's Law, and raise you Gustafson's Law [wikipedia.org].

  • by Anonymous Coward on Thursday July 15, 2010 @07:09PM (#32920912)

    You might find this [youtube.com] Google Tech Talk interesting..

  • by psilambda (1857088) <kappa&psilambda,com> on Thursday July 15, 2010 @09:19PM (#32922064)
    The article and everybody else are ignoring one large, valid use of GPUs in the data center--whether you call it business intelligence or OLAP--it needs to be in the data center and it needs some serious number crunching. There is not as much difference between this and scientific number crunching as most people might think. I have been involved in both crunching numbers for financials at a major multinational and had the privilege of being the first to process the first full genome (complete genetic sequence--terabytes of data) for a single individual and actually the genomic analysis was much more integer based than the financials. Based on my experience with both, I created the Kappa library for doing CUDA or OpenMP analysis in a datacenter--whether for business or scientific work.
  • by David Greene (463) on Friday July 16, 2010 @01:09AM (#32923228)

    The stream architecture of modern GPU's work radically differently than a conventional CPU.

    True if the comparison is to a commodity scalar CPU.

    It is not as simple as scaling conventional multi-threading up to thousands of threads.

    True. Many algorithms will not map well to the architecture. However, many others will map extremely well. Many scientific codes have been tuned over the decades to exploit high degrees of parallelism. Often the small data sets are the primary bottleneck. Strong scaling is hard, weak scaling is relatively easy.

    Certain things that you are used to doing on a normal processor have an insane cost in GPU hardware.

    In a sense. These are not scalar CPUs and traditional scalar optimization, while important, won't utilize the machine well. I can't think of any particular operation that's greatly slower then on a conventional CPU, provided one uses the programming model correctly (and some codes don't map well to that model).

    For instance, the if statement.

    No. Branching works perfectly fine if you program the GPU as a vector machine. The reason branches within a warp (using NVIDIA terminology) are expensive is simply because a warp is really a vector. The GPU vendors just don't want to tell you that because either they fear being tied to some perceived historical baggage with that term or they want to convince you they're doing something really new. GPUs are interesting, but they're really just threaded vector processors. Don't misunderstand me, though, it's a quite interesting architecture to work with!

  • by Anonymous Coward on Friday July 16, 2010 @02:52AM (#32923658)

    I've heard that many programmers have issues coding for 2 and 4 core processors. I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.

    If that's the paradigm they're operating in, it will probably fail spectacularly. Let me explain why.

    In the end, GPU's are essentially vector processors [wikipedia.org] (yes, I know that's not exactly how they work internally, but bear with me). You feed them one or more input vectors of data and one or two storage vectors for output and they do the same calculation on every element of the input and store the results in the output. Think about what you need for pixel rendering: it's things like "apply a fixed Affine transform to every pixel of the input image and store the results as the output image" or "add [alpha blend] these two images together and store the result." These are the kind of tasks vector processors like the old Cray's were designed to implement efficiently; compilers implementing OpenMP [wikipedia.org] are also working within this kind of paradigm.

    Threads, in contrast to vector processing, are independent streams of execution. While you can use threads to split a loop into pieces, the normal thread pattern is something more like "wait for an event, and then respond to it appropriately." The real problem here is that because threads are independent tasks, memory sharing is hard (semaphores, spin locks, and all that) because you can't guarantee the behavior of any other thread.

    Clusters, finally, as a few people have mentioned (although perhaps never used), are different yet again. While each node in a cluster runs as an independent machine and thus conceptually resembles a thread, the nodes don't have a pool of shared memory (they may not even have shared disk space!). If I want to get data from node A to node B, I have to copy it over the network. Because the internal bandwidth of a cluster is so much lower than the memory bus of a shared-memory computer, you spend most of your time figuring out how to minimize the amount of data you have to copy between nodes and worrying about things like cluster topology. As a result, algorithms that scale well on a shared-memory machine may or may not scale well at all on a distributed cluster.

    So why bother? Because each design has its own strengths and weaknesses. Vector processors are great if your doing a vector operation, but things like stream processing (e.g., compressing video data) don't vectorize particularly well. Threads are generic and flexible; so flexible that you can't really optimize the hardware for them. They also require discipline to avoid dead-locks and other related problems. Clusters, finally, are inexpensive and are ideally suited for "batch" tasks like web servers or databases where each thread really is an independent job, but for things like weather simulations (where lots of data has to be exchanged between nodes) they require very careful attention to the algorithms used or the performance can tank as the size of the system gets large.

I'd rather just believe that it's done by little elves running around.

Working...