Follow Slashdot stories on Twitter


Forgot your password?
Supercomputing Hardware

Ask Slashdot: Parallel Cluster In a Box? 205

QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Parallel Cluster In a Box?

Comments Filter:
  • by Anonymous Coward on Saturday December 03, 2011 @01:41PM (#38250916)

    If the off-the-shelf GTX cards work, you'd have 8 * Xeon + 8 * NVidia GPU's in 3U, all entirely parallel (I.E. 8 separate machines) to avoid the main CPU's being any kind of bottleneck. Stock each node w/ 2GB of RAM on the cheap and some cheaper SATA drives, you'd likely end up under $10k for the whole thing and have an 8-node cluster you can use for other tasks later.

    I've noticed that "embarrassingly parallel" tasks, if you take the low-hanging fruit too far, end up running into some other unforeseen bottleneck. Thus me suggesting something faux-bladeish instead.

  • PS3 (Score:3, Interesting)

    by History's Coming To ( 1059484 ) on Saturday December 03, 2011 @01:41PM (#38250918) Journal
    PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
  • by Anonymous Coward on Saturday December 03, 2011 @01:48PM (#38250974)

    You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.

  • by Anonymous Coward on Saturday December 03, 2011 @02:05PM (#38251094)

    You can get 48 real AMD Magny-Cours CPU cores with full DP floating point support and ~64GB ECC memory in a box for under 10K(EUR!) from e.g. Tyan and supermicro.
    I run my embarassingly parallel stuff on that, and it works great. Depending on your application 64 Bulldozer cores which come in the same package for only slightly more money may perform better or not. I have not seen many realworld applications in which one GPU is actually faster than 12 to 16 server-class CPU cores.
    Of course this depends a lot on wether you have done the GPU porting already or are just planning to, which you unfortunately don't state in your post

  • Definitely GPU. (Score:5, Interesting)

    by pla ( 258480 ) on Saturday December 03, 2011 @02:06PM (#38251106) Journal
    Others have pointed it out, but if you can run this on a GPU, you don't need to look any further than that.

    Specifically, check out some of the BitCoin mining rigs [] people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.
  • by Arakageeta ( 671142 ) on Saturday December 03, 2011 @02:41PM (#38251380)

    What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).

    My research group built this 12-core/8-GPU system last year for about $10k: []

    The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.

  • Re:AMD (Score:4, Interesting)

    by sneakyimp ( 1161443 ) on Saturday December 03, 2011 @07:01PM (#38253236)

    I wonder if QuantumMist must take into account the cost of development. To say that the application is "embarassingly parallel" and at the same time that "memory requirements are decently low" suggests that s/he has an existing application that has been run on some box and perhaps belies a bit of ignorance about the nature of parallelism. Last time I checked, more threads required more memory. If the plan is to get the maximum number of threads possible, the amount of memory required could vary enormously. Additionally, the nature of the parallelism is not discussed. What does each thread do? If it's not something a GPU does then GPUs are not going to help. Also, will a GPU even fit in a 1u box that already contains a server? I doubt it.

    In my very limited experience in writing multithreaded code, I have found that simply increasing the number of threads spawned doesn't necessarily equate to better performance. On the contrary, spawning too many can bring your application to a halt as an enormous number of threads vie for limited resources (network, disk, memory) and your application gets nothing done because it's too busy context switching between a huge number of resource-starved threads that do nothing while the threads that hold the resources never get scheduled to do valuable work.

    I'd also like to point out that simply buying GPUs doesn't mean your application will suddenly spawn an ability to take advantage of even one GPU. The software development effort required to add GPU detection and utilization could easily chew up that $10-15k budget in no time.

    If QuantumMist already has this application written and it's running but NOT GPU-enabled, then the best approach might be to just get the hottest multi-socket traditional CPU machine s/he can afford built on a dual LGA 1366 mobo [] or quad g34 mobo []. Or, depending on the nature of this parallelism, it might be better to budget for some CUDA software development and a machine with a couple of GPUs.

"If the code and the comments disagree, then both are probably wrong." -- Norm Schryer