Ask Slashdot: Parallel Cluster In a Box? 205

Posted by timothy on Saturday December 03, 2011 @01:31PM from the must-fit-in-a-briefcase-too dept.

QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"

Ask Slashdot: Parallel Cluster In a Box?

This discussion has been archived. No new comments can be posted.

Search 205 Comments Log In/Create an Account

Comments Filter:

AMD (Score:2, Informative)

by Anonymous Coward writes: on Saturday December 03, 2011 @01:35PM (#38250862)

Why not use AMD and OpenCL?

Nothing special (Score:2, Informative)

by Anonymous Coward writes: on Saturday December 03, 2011 @01:40PM (#38250910)

Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
When choosing cards, look for tests like this one:
http://www.behardware.com/articles/840-13/roundup-a-review-of-the-super-geforce-gtx-580s-from-asus-evga-gainward-gigabyte-msi-and-zotac.html
The IR thermal photos are great when choosing well cooled card.
Also use SW to control card fans to keep them running at 100% fan speed.
Noisy? Yes. But who cares, unless you plan putting it in your bedroom.
You can easily keep these cards at ~70C with full load.

Need more information (Score:4, Informative)

by pem ( 1013437 ) writes: on Saturday December 03, 2011 @01:49PM (#38250984)

If, for example, it's embarrassing parallel DSP operations, you might try some dedicated DSP engines, or even some Xilinx FPGAs.

U of I (Score:4, Informative)

by TheGreatOrangePeel ( 618581 ) writes: on Saturday December 03, 2011 @01:58PM (#38251046) Homepage
Try getting in touch with the folks doing parallel processing research or the people with NCSA at U of I. I imagine one or both would have a few tips for you assuming they're open to doing that kind of collaboration.
- http://parallel.illinois.edu/
- http://www.ncsa.illinois.edu/
Re:PS3 (Score:2, Informative)

by Anonymous Coward writes: on Saturday December 03, 2011 @02:01PM (#38251068)

I wouldn't give Sony a dollar of my business if they had the cure for cancer and I was a week away from death.

commodity HPC depends on your code (Score:5, Informative)

by Haven ( 34895 ) writes: on Saturday December 03, 2011 @02:11PM (#38251140) Homepage Journal

In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]
If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops
Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).
Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)
Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf [cray.com]
It all depends on your code.

Re:Nothing special (Score:5, Informative)

by TWX ( 665546 ) writes: on Saturday December 03, 2011 @02:13PM (#38251156)

It would have been nice if he'd given us more information about the form factor he needs to put this into. Since the client isn't paying the electric or cooling bill then I have to assume that it's colocated, so there might be some real rack unit restrictions that prevent this from adequately working well. It also would have been nice to know storage demands too, as there are tradeoffs in front-accessible drive arrays for cooling and airflow purposes. Most of the cases with tons of hot-swap drives in front lack good front ventilation. If he only needs a few drives then that opens him up to a simple 3U or 4U chassis with a mostly open-grille of a front to make airflow a lot less restrictive.

Re:3k - 64cores + 54+GB of ram. (Score:2, Informative)

by Anonymous Coward writes: on Saturday December 03, 2011 @02:13PM (#38251158)

NewEgg. The 4 socket and extension boards are below 1k together. And the low-avg speed 16 core opterons are about 300-400, so 350*8 + 700 (board+extension) = 3.5k. The other 1.5k are power, 1333ghz ram, and the 1u container.
You can of course spend a lot more if you want the fastest opterons, but the return goes down quickly, the 2.2Ghz are fast, cheap, 16core cpus.

How does it parallelize? (Score:5, Informative)

by darkjedi521 ( 744526 ) writes: on Saturday December 03, 2011 @02:15PM (#38251170)

How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?
GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.
For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.
Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.

Re:PS3 (Score:5, Informative)

by Anonymous Coward writes: on Saturday December 03, 2011 @02:15PM (#38251174)

PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
Back-of-the-envelope comparison of PS3 and GTX:
A cluster of three PS3s: 920 GFLOPS. Price: about $800.
A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.
Each of those GTX cards also has significantly more memory than the PS3, and are cheaper to develop for.

Beowulf clusters (Score:4, Informative)

by G3ckoG33k ( 647276 ) writes: on Saturday December 03, 2011 @02:25PM (#38251248)

Yes, I haven't seen any references here or anywhere else either lately.
From http://en.wikipedia.org/wiki/Beowulf_cluster [wikipedia.org]: "The name Beowulf originally referred to a specific computer built in 1994 by Thomas Sterling and Donald Becker at NASA. [...] There is no particular piece of software that defines a cluster as a Beowulf. Beowulf clusters normally run a Unix-like operating system, such as BSD, Linux, or Solaris, normally built from free and open source software. Commonly used parallel processing libraries include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI software include OpenMPI or MPICH. There are additional MPI implementations available. Beowulf systems are now deployed worldwide, chiefly in support of scientific computing."
Apparently, Beowuld clusters may be around, it is just that they don't go by that name any longer. I wonder what would be the latest buzzword for essentially the same thing?

Re:3k - 64cores + 54+GB of ram. (Score:5, Informative)

by dch24 ( 904899 ) writes: on Saturday December 03, 2011 @02:28PM (#38251282) Journal
Just took a look. They have 4 choices for a 16-core opteron listen:
- AMD Opteron 6262 HE Interlagos 1.6GHz Socket G34 85W 16-Core Server Processor OS6262VATGGGU - OEM $539.99
- AMD Opteron 6272 Interlagos 2.1GHz Socket G34 115W 16-Core Server Processor OS6272WKTGGGUWOF $539.99
- AMD Opteron 6274 Interlagos 2.2GHz Socket G34 115W 16-Core Server Processor OS6274WKTGGGUWOF $659.99 out of stock
- AMD Opteron 6274 Interlagos 2.2GHz Socket G34 115W 16-Core Server Processor OS6274WKTGGGU - OEM $659.99 out of stock
I'm going to keep looking, but I don't see any in the 300-400 range.
Don't buy GTX's (Score:5, Informative)

by MetricT ( 128876 ) writes: on Saturday December 03, 2011 @02:56PM (#38251536)

We have several racks full, purchased because "they're cheaper than Tesla's".
Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.
The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).
Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.

Do your homework before going GPU (Score:4, Informative)

by PatDev ( 1344467 ) writes: on Saturday December 03, 2011 @04:36PM (#38252304)

As someone who has done some GPU programming (specifically CUDA) be aware that there is more to the GPU parallelism model than just "lots of threads". Many embarrassingly parallel problems translate very poorly to CUDA. The primary things to consider is that:

1. GPUs are *data parallel*. This means that you need to have an algorithm in which each and every thread will be executing the same instruction at the same time (just on different data). For a cheap way to evaluate it, if you can't speed up your program by vectorizing it then the GPU won't help. Of course, you can have divergent "threads" on GPUs, but as soon as you do you've lost all benefit to using a GPU, and have essentially turned your GPU into an expensive but slow computer.

2. Moving data onto or off of the GPU is *slow*. So if you can leave all the data on the GPUs and none of the GPUs need to communicate with each other, then this will work well. If the threads need to frequently globally sync up, you're going to be in trouble.

That said, if you have the right kind of data parallel problem, GPUs will blow everything else out of the water at the same price point.

Re:PS3 (Score:4, Informative)

by QuantumRiff ( 120817 ) writes: on Saturday December 03, 2011 @06:11PM (#38252888)

actually, you can run up to 16 PCIe slots in an external chassis for heavy processing:
http://www.dell.com/us/business/p/poweredge-c410x/pd [dell.com]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Ask Slashdot: Parallel Cluster In a Box? 205

Ask Slashdot: Parallel Cluster In a Box? More Login

Ask Slashdot: Parallel Cluster In a Box?

AMD (Score:2, Informative)

Nothing special (Score:2, Informative)

Need more information (Score:4, Informative)

U of I (Score:4, Informative)

Re:PS3 (Score:2, Informative)

commodity HPC depends on your code (Score:5, Informative)

Re:Nothing special (Score:5, Informative)

Re:3k - 64cores + 54+GB of ram. (Score:2, Informative)

How does it parallelize? (Score:5, Informative)

Re:PS3 (Score:5, Informative)

Beowulf clusters (Score:4, Informative)

Re:3k - 64cores + 54+GB of ram. (Score:5, Informative)

Don't buy GTX's (Score:5, Informative)

Do your homework before going GPU (Score:4, Informative)

Re:PS3 (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot