Ask Slashdot: Parallel Cluster In a Box? 205
QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"
AMD (Score:2, Informative)
Re:AMD (Score:5, Insightful)
CUDA has been around a while, figuring it out isn't such a rough learning curve.
Overall I'm a little suspicious of someone looking to use a GPU for more threads on a problem. As going the GPU route is a really committed step, and the programming gets a new level of complicated. Using multiple cards has some odd issues in CUDA, ie. If you exceed the card index it defaults to card-0, rather than crashing. There are more places to screw up with a GPU- transferring memory- getting blocks, threads, and weaves organized(if done properly it hides all sorts of latency in calculations, done poorly it's worse than a CPU)- avoiding memory contention (the memory scheme isn't bad, but it needs to be understood).
So in most cases I'd first start with this chart http://www.cpubenchmark.net/cpu_value_available.html [cpubenchmark.net] and tell them to cut their teeth on a GPU with a smaller(cheaper) test case.
Re: (Score:2)
Because it's new, and finding someone who's done it to get some pointers is really hard.
CUDA has been around a while, figuring it out isn't such a rough learning curve.
On the downside, you're stuck with NVidia GPUs forever (or until they decide to drop CUDA, although I'll admit that's unlikely).
Re: (Score:2, Insightful)
That's why you would use OpenCL instead. It's a bit newer, and is still a little rough around the edges, but it works on CPU's and GPU's, and in windows or 'nix.
Re:AMD (Score:4, Interesting)
I wonder if QuantumMist must take into account the cost of development. To say that the application is "embarassingly parallel" and at the same time that "memory requirements are decently low" suggests that s/he has an existing application that has been run on some box and perhaps belies a bit of ignorance about the nature of parallelism. Last time I checked, more threads required more memory. If the plan is to get the maximum number of threads possible, the amount of memory required could vary enormously. Additionally, the nature of the parallelism is not discussed. What does each thread do? If it's not something a GPU does then GPUs are not going to help. Also, will a GPU even fit in a 1u box that already contains a server? I doubt it.
In my very limited experience in writing multithreaded code, I have found that simply increasing the number of threads spawned doesn't necessarily equate to better performance. On the contrary, spawning too many can bring your application to a halt as an enormous number of threads vie for limited resources (network, disk, memory) and your application gets nothing done because it's too busy context switching between a huge number of resource-starved threads that do nothing while the threads that hold the resources never get scheduled to do valuable work.
I'd also like to point out that simply buying GPUs doesn't mean your application will suddenly spawn an ability to take advantage of even one GPU. The software development effort required to add GPU detection and utilization could easily chew up that $10-15k budget in no time.
If QuantumMist already has this application written and it's running but NOT GPU-enabled, then the best approach might be to just get the hottest multi-socket traditional CPU machine s/he can afford built on a dual LGA 1366 mobo [newegg.com] or quad g34 mobo [newegg.com]. Or, depending on the nature of this parallelism, it might be better to budget for some CUDA software development and a machine with a couple of GPUs.
Beowulf cluster! (Score:5, Funny)
Why not a beowulf clust---
I'm sorry, I just can't. I searched the ~35 posts, browsing at -1, and no reference to a Beowulf cluster anywhere, let alone Natalie Portman or Grits.
Slashdot! You're slipping! I lament the days when even our trolls were amusing and somewhat topical to the discussion at hand! We've fallen so far!
Beowulf clusters (Score:4, Informative)
Yes, I haven't seen any references here or anywhere else either lately.
From http://en.wikipedia.org/wiki/Beowulf_cluster [wikipedia.org]: "The name Beowulf originally referred to a specific computer built in 1994 by Thomas Sterling and Donald Becker at NASA. [...] There is no particular piece of software that defines a cluster as a Beowulf. Beowulf clusters normally run a Unix-like operating system, such as BSD, Linux, or Solaris, normally built from free and open source software. Commonly used parallel processing libraries include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI software include OpenMPI or MPICH. There are additional MPI implementations available. Beowulf systems are now deployed worldwide, chiefly in support of scientific computing."
Apparently, Beowuld clusters may be around, it is just that they don't go by that name any longer. I wonder what would be the latest buzzword for essentially the same thing?
Re: (Score:2)
Do they just call it nothing now days it is just expeced to be some variant, or that it is so mainstream?
How to setup a Beowulf cluster (Score:2)
Re: (Score:2)
Beowolf cluster? Is that some new fangled grid computing system?
So yeah, the guy in the 665546 id number tells us all about the old days. Come on!
Re: (Score:2)
I wanted a better handle.
Re: (Score:2, Insightful)
Why not use AMD and OpenCL?
Sure use two AMD 6990 with 3072 stream units each, for a total of 6144 ALUs per box (DP FPU) with OpenCL 1.1.
Cost about $2500 per box! $700 per card plus $1000 for a CPU system with 1000W PSU.
Nothing special (Score:2, Informative)
Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
When choosing cards, look for tests like this one:
http://www.behardware.com/articles/840-13/roundup-a-review-of-the-super-geforce-gtx-580s-from-asus-evga-gainward-gigabyte-msi-and-zotac.html
The IR thermal photos are great when choosing well cooled card.
Also use SW to control card fans to keep them running at 100% fan speed.
Noisy? Yes. But who cares, unless you plan putting it in your b
Re:Nothing special (Score:5, Informative)
It would have been nice if he'd given us more information about the form factor he needs to put this into. Since the client isn't paying the electric or cooling bill then I have to assume that it's colocated, so there might be some real rack unit restrictions that prevent this from adequately working well. It also would have been nice to know storage demands too, as there are tradeoffs in front-accessible drive arrays for cooling and airflow purposes. Most of the cases with tons of hot-swap drives in front lack good front ventilation. If he only needs a few drives then that opens him up to a simple 3U or 4U chassis with a mostly open-grille of a front to make airflow a lot less restrictive.
Re: (Score:2)
Re: (Score:2)
Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
That's actually pretty hard to do as you need a motherboard with lots of multiple-lane PCIe connections.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
I recall it is possible to fit a 16x card in a 1x slot (Obviously at 1x performance), but this requires the card be hacked. Literally. With a hacksaw. All the power and essential control lanes are at the front, and if 15 of the 16 data lanes are not connected then the card will simply not use them.
Impractical. GPU cards have issues with bandwidth to the host anyway, cut it to 1x and you will be much better of with a plain multicore system.
Re: (Score:2)
SuperMicro MicroCloud w/ 8 NVidia GPUs? (Score:2, Interesting)
If the off-the-shelf GTX cards work, you'd have 8 * Xeon + 8 * NVidia GPU's in 3U, all entirely parallel (I.E. 8 separate machines) to avoid the main CPU's being any kind of bottleneck. Stock each node w/ 2GB of RAM on the cheap and some cheaper SATA drives, you'd likely end up under $10k for the whole thing and have an 8-node cluster you can use for other tasks later.
I've noticed that "embarrassingly parallel" tasks, if you take the low-hanging fruit too far, end up running into some other unforeseen bottl
Re: (Score:2)
Unless he is heavily space constrained, he should probably take your advice on specs; but in 1 or 2U cases where getting a double-wide, full profile, PCIe x16 card installed will be easier.
PS3 (Score:3, Interesting)
Re: (Score:2, Informative)
I wouldn't give Sony a dollar of my business if they had the cure for cancer and I was a week away from death.
Re: (Score:2)
Buying lots of subsidise PS3 and then NOT buying the games they are worse off.
Re: (Score:2)
4 years of losses and they're still around. Talk to me when they actually disappear.
Re: (Score:2)
Re: (Score:2)
Yes, I'm well aware of the concept - we'll buy lots of their hardware to do something other than support them, thus hurting them because they lose $10/unit.
The problem is this activity still adds to their perceived marketshare and boosts their efforts and also reduces stock for items they're building anyways, and it also hurts their competition by reducing their revenue, demand, and perceived marketshare.
Buy someone else's hardware, and support them, rather than reducing Sony's potential losses. After all,
Re:PS3 (Score:5, Informative)
Back-of-the-envelope comparison of PS3 and GTX:
A cluster of three PS3s: 920 GFLOPS. Price: about $800.
A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.
Each of those GTX cards also has significantly more memory than the PS3, and are cheaper to develop for.
Re: (Score:2)
Re:PS3 (Score:4, Informative)
actually, you can run up to 16 PCIe slots in an external chassis for heavy processing:
http://www.dell.com/us/business/p/poweredge-c410x/pd [dell.com]
Re: (Score:2)
A cluster of three PS3s: 920 GFLOPS. Price: about $800.
Less than that, because they would have to buy older used PS3's, CECHA/CECHB/CECHE models and they'd need to have pre 3.21 firmware. Difficult but probably cheaper.
A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.
Woudn't the 450's alone cost about 500, let alone a motherboard that can handle 3 of them, and a good power supply and cooling. I think that PC total is estimating a bit low.
Perhaps the poster could do both, because some calculations might work better on the PS3's and some on the 460's. Even in Folding@home, there's still calculations the PS3'
Re: (Score:2)
Cards:
http://www.newegg.com/Product/Product.aspx?Item=N82E16814162058&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Video+Cards-_-Galaxy-_-14162058 [newegg.com]
x3 = $360.
Motherboard:
http://www.newegg.com/Product/Product.aspx?Item=N82E16813128495 [newegg.com]
$114
Power supply:
http://www.newegg.com/Product/Product.aspx?Item=N82E16817152044 [newegg.com]
$144
CPU can be less than $50 if he really doesn't need the cpu to do much of anything.
So far I'm at $668. Probably have to buy a box to put it in for $50.
So now i'm at $718. What shall I buy with m
Re: (Score:2)
Perfect, right on budget!
Re: (Score:2)
Hard drive, you might want an SSD for performance, and DVD drive.
Re: (Score:2)
Umm, the PS3 has a theoretical performance of 2TFLOPS, EACH.
Re: (Score:2)
But actual performance is apparently drastically lower:
http://en.wikipedia.org/wiki/PlayStation_3_hardware [wikipedia.org]
PlayStation 3's Cell CPU achieves a maximum of 230.4 GFLOPS in single precision floating point operations and 100 GFLOPS double precision.
Re: (Score:2)
It is only lower because of hypervisor restrictions.
Unfettered single-point leveraging the entire system (including the GPU) you can get around 1.3-1.5TFLOPS practical.
The issue, again, is the hypervisor.
Re:PS3 -- sure, if you like your CPUs from 2006. (Score:2)
I think the time of the PS3 clusters has past. The Cell processor was released back in 2006! IBM released a few upgraded processors, mostly improving double-precision performance, but those systems are really cost prohibitive.
Assuming you can deal with PCIe latency, GPUs are the way to go.
Re: (Score:2)
Re: (Score:2)
In theory there is no difference between theory and practice, but in practice there is.
Re: (Score:2)
PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
It does have a conspicuously high price/performance ratio, but if you use it for a cluster, you won't be able to play any games. I'm pretty certain Sony locks PS3 clusters out of their gaming network, for reasons unknown to anyone but themselves.
can you write GPU code? (Score:5, Insightful)
do you or them know how to program on a GPU?
if its really embarrassingly parallel EC2 spot instances and the gnu program 'parallel' will work quite nicely.
But if coding changes are required then the hardware is the least of your expenses.
Re: (Score:2)
Exactly. Unless the user has some experience in CUDA/Compute shaders/OpenCL, just shoving cards in there doesn't really solve the problem.
Die of heat? (Score:3)
> Should I just stuff four GTX cards in a server and replace them as they die from heat?
It'd be more cost-efficient to improve the air flow or add liquid cooling. Yay mineral oil baths.
3k - 64cores + 54+GB of ram. (Score:4, Interesting)
You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.
Re: (Score:2)
It's easy to get an embarrassing amount of processing power if you go with white box equipment. I have 8 8-way 1-U servers with 32 GB of RAM serving a heavy, database driven app. The amount of stuff that gets done with that relatively small value-priced cluster is impressive.
Re: (Score:2)
I'd love to see the full spec of those machines.
Re: (Score:2, Informative)
NewEgg. The 4 socket and extension boards are below 1k together. And the low-avg speed 16 core opterons are about 300-400, so 350*8 + 700 (board+extension) = 3.5k. The other 1.5k are power, 1333ghz ram, and the 1u container.
You can of course spend a lot more if you want the fastest opterons, but the return goes down quickly, the 2.2Ghz are fast, cheap, 16core cpus.
Re:3k - 64cores + 54+GB of ram. (Score:5, Informative)
I'm going to keep looking, but I don't see any in the 300-400 range.
Re: (Score:2)
Just took a look. They have 4 choices for a 16-core opteron listen:
AMD Opteron 6262 HE Interlagos ...
It's worse than that. The submitter is talking about doing single-precision floating point. Interlagos only has 1 floating point unit for every 2 integer cores. So, for his purposes, it's only 8 cores per cpu.
Need more information (Score:4, Informative)
Re: (Score:2)
A GPU will spank a dedicated DSP chip at just about everything, even the highest end TI's and TigerSHARCs. Both DSPs and GPUs are designed to haul data out of memory and do vector multiplication on it, but the GPU has a heck of a lot more of both memory bandwidth and processing grunt.
A big FPGA card, or FPGA array system like a Copacobana, might be quicker assuming I/O limitations aren't a problem for the algorithm to be run. But FPGA hardware for HPC isn't really a commodity so it's awfully expensive - you
U of I (Score:4, Informative)
Re: (Score:2)
MOD PARENT UP. Parallel processing is tricky stuff and performance depends on so many things -- not just the cost of a bunch of GPUs.
Do you need it in a box? (Score:2)
If it's really embarrassingly parallel, just run it on whatever CPUs you have hanging about or can scrounge cheaply. As long as the application is written portably they don't even need to be the same architecture or operating system, although that would help with deployment. The only reason to try to scrunch everything in one box would be if you have space limitations.
many AMD CPUs unless the GPU port is done already (Score:2, Interesting)
You can get 48 real AMD Magny-Cours CPU cores with full DP floating point support and ~64GB ECC memory in a box for under 10K(EUR!) from e.g. Tyan and supermicro.
I run my embarassingly parallel stuff on that, and it works great. Depending on your application 64 Bulldozer cores which come in the same package for only slightly more money may perform better or not. I have not seen many realworld applications in which one GPU is actually faster than 12 to 16 server-class CPU cores.
Of course this depends a lot o
Definitely GPU. (Score:5, Interesting)
Specifically, check out some of the BitCoin mining rigs [bitcointalk.org] people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.
commodity HPC depends on your code (Score:5, Informative)
In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]
If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops
Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).
Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)
Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf [cray.com]
It all depends on your code.
We built a ~9.1 TFLOPS system for $10k last year. (Score:5, Interesting)
What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).
My research group built this 12-core/8-GPU system last year for about $10k: http://tinyurl.com/7ecqjfj [tinyurl.com]
The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.
rent a botnet (Score:4, Funny)
Re: (Score:2)
Throw up a web site, advertise it on over-clocker forums and what-not, and hold a competition..
A race with $15000 in prize money. The runners are scored on how many "work units" they complete. Work units are distributed randomly and multiple people receive the same units so there is result verification. 1st place gets $5000, 2nd place gets $4000, 3rd place gets $3000, 4th place gets $2000, and 5th place gets $1000.
Get volunteers for a botnet (Score:2)
Heck, you'd be surprised how many projects [berkeley.edu] have gotten people to volunteer to run such things. All you have to do is provide good uptime and statistics and people will come running! (Though a good project description helps too.)
How does it parallelize? (Score:5, Informative)
How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?
GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.
For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.
Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.
Re: (Score:2)
Re: (Score:2)
Finally someone talking sense. You go darkjedi.
Passive cooled GPU (Score:2)
Don't use high end GTX cards; twice as many lower end passively-cooled GPU cards will provide more than the equivalent performance with far less cost and failure rate. If your application really benefits more from additional threads vs single thread execution speed, this is the way to go. Most GPGPU clusters that aren't built using Tegra use this approach.
need a better characterization of the workload (Score:2)
big FP bandwidth on a tesla doesn't do much for you if you only need integer execution. Maybe you'd be better off with a 4-cpu xeon box, or a bulldozer, or a 64-core arm. Really, you want to find a way to benchmark your particular software on a variety of potential cpu targets, and then do a price comparison.
Why, mini-cluster, of course! (Score:3)
http://www.mini-itx.com/projects/cluster/?p [mini-itx.com]
The example at the URL above is quite old, but a good starting point. Just use a dozen cheap mini-itx cards with -- let's say -- Intel Core i5 and voilà! Probably the cheapest way to go, and, also much easier to program than using CUDA and nVidia. Hook the whole thing in a gigabit switch
I'll let the experts debate the best CPU for that job, but AMD should also have some nice products on offer.
Don't buy GTX's (Score:5, Informative)
We have several racks full, purchased because "they're cheaper than Tesla's".
Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.
The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).
Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.
Re: (Score:3)
Plenty of hacks to enable GPU load level. Probably several already out there as-is. The ECC memory is a different beast, though.
Embarrassingly parallel problems... (Score:2)
... do not require embarrassingly parallel solutions.
They require math and algorithm design to make the solution *nonembarrassing*.
Give you an example: a typical FFT can, with easy math, cut it number of calculations by four. With a little care, you can halve the number of calculations again.
Start with the math. Then look at the solution.
Last of all, consider cloudware. It's out there. Let's see... on my android, I have "sourceLair". Yeah, that's one.
Once you have the cloudware solution in hand, *then
Re: (Score:2)
Ah, generalizations. Of course, you have no idea what he's working on.
Re: (Score:3)
Yes I do. He's extending the calculations begun by Lewis Carroll in the imaginary space (through the looking glass), to see the effects as the ultimate limit increases.
What's
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
1+1+1+1+1+1+1+1+1+1+1+1
As I said, embarrassingly parallel. Get 7 computers working on it in parallel, with 1 for backup:
What's 1+1+1+1+1+1 (after some calculation, 6)
So that all is 42.
the ultimate answer is
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1
Re: (Score:2)
And your Fourier transform algorithm to solve it faster is?
Immerse it (Score:2)
Go old school and immerse the entire machine in a tub of mineral oil?
Re: (Score:2)
Go old school and immerse the entire machine in a tub of mineral oil?
The best stuff to use is synthetic plasma (as in blood plasma). Its rather expensive though. [citation needed]
Try looking at the cheap end... (Score:3)
I've played this parallel cost analysis game several times, and if you don't need high bandwidth communication between the threads, I usually come up with the Google solution: a big farm of cheap machines. AMD chips start looking good compared to Intel because you're not after a single thread finishing as fast as possible, you're after as many FLOPS per $ as you can get. We even did the analysis for an extreme Apple fanboi: MacPros vs MacMinis back in 2007, and a stack of 25 minis came out way more powerful than the 3 or 4 Pros you could get for the same money.
Re: (Score:2)
Mac Mini Server gets you a quad-core Intel i7 (double that number of threads if you enable hyper threading) for $999. Turn them on their side and you can stack 11 of them in the width of a standard 19" rack (will be 6U high or so). That's 44 (or perhaps 88) cores for under $11,000.
Other pluses: 900W power consumption when running at 100% utilization, idle is much much lower. Comes with dual hard drives that can be mirrored for reliability. Gigabit ethernet and 4 USB ports are available. When your work w
Do your homework before going GPU (Score:4, Informative)
1. GPUs are *data parallel*. This means that you need to have an algorithm in which each and every thread will be executing the same instruction at the same time (just on different data). For a cheap way to evaluate it, if you can't speed up your program by vectorizing it then the GPU won't help. Of course, you can have divergent "threads" on GPUs, but as soon as you do you've lost all benefit to using a GPU, and have essentially turned your GPU into an expensive but slow computer.
2. Moving data onto or off of the GPU is *slow*. So if you can leave all the data on the GPUs and none of the GPUs need to communicate with each other, then this will work well. If the threads need to frequently globally sync up, you're going to be in trouble.
That said, if you have the right kind of data parallel problem, GPUs will blow everything else out of the water at the same price point.
TI DSP cards? (Score:3)
There's some high-powerd PCI cards filled with TI DSPs that you can get. Here's an article describing some of them. [theregister.co.uk] In terms of power efficiency per unit of work, the DSPs blow the doors off the main processor and the GPUs. Each DSP on the chip can do 16 single precision or 4 double precision floating point operations per cycle, at around 1GHz, and they're programmable in C/C++.
Relevant quote:
Buy 5 of these and you're only at 550W, $10,000 and 5 TFLOPs.
Re: (Score:2)
Re: (Score:2)
Well, like I said in my followup "reply to self", it really does depend on the nature of the task. We don't have enough information to go on. The TI DSP cards do fill an interesting niche, though, and are a nice counterpoint to the Tesla cards in many applications.
Really, you need to just get some demo tools for a couple platforms, do some benchmarks, and see how each platform feels. You'd be silly to drop $10,000 - $15,000 on a server without first running some benchmarks on a smaller version of what yo
GTX really is less reliable (Score:2)
Have you ever written CUDA code before? (Score:2)
Depends on lots of things (Score:2)
You mention GPU but can you use get the solution up and running as quickly as the cpu solution? Optimised multi-gpu solutions are not that easy as the programmer has to do all the heavy lifting.
Does the code vectorise? If is does, then I'd be tempted to go with as many dual socket Intel machines as you can. Are you able to use the Intel compiler (leveraging into the MKL, IPP and IMF as much as possible). This assumes that communication is low. You are not going to have the cash for a low latency, high
Calxeda (Score:2)
Just because you mentioned ARM, perhaps you should look into Calxeda. I have no idea if their solution is well suited for your problem, it is a whole bunch of 32bit cores in one box. Someone else already has a similar arrangement using Intel Atom.
Amazon is not too expensive (Score:2)
You may be able to buy hardware more cheaply, but you're not going to beat Amazon on overall cost, once you take even minimal maintenance, power, server room space, etc. into account. You may be able to save money over EC2 by putting in your own labor, just realize that this can be a lot of work.
Re: (Score:2)
He specifically said it's too expensive. RTFQ :P
Re: (Score:2)
Re: (Score:2)
1. Lower failure rates mean less, not more, maintenance expenditures.
2. The more robust general (non-gpu-based) system can handle those failures better because the workload is distributed over a greater number of cores
3. The more robust system can also handle any future workload that doesn't translate easily into a gpu-based solution
4. Electricity was specifically not an issue - it was someone else's cost - which you failed to realize because you always post stupidity. Who's to say that maintenance
Re: (Score:2)
For a dim-bulb like you, I'll make it simple: Pretend I have a huge warehouse that I need illuminated for the next year. I can buy a few very expensive LED arrays (and still end up with shadows), or a ton of cheap florescents, and get complete coverage.
As per the article, I don't care about electr
Re: (Score:2)
Re: (Score:2)
> "what if scenarios do not require creation."
They most certainly do! Your creating what-if scenarios that have no basis in TFA was just as lame as your attempt to say it wasn't practical because of the higher cost of electrical consumption, when TFA made it clear that electricity use wasn't a consideration. Trying for stupid post of the year award?
You just don't like that a woman caught you on your original mistake (not noticing that the original article specifically said to ignore electrical consu
Re: (Score:2)
So, not only did you miss the part where they said that electricity use wasn't an issue - you also missed the posters concerns about thermal load. Did you read ANYTHING except the headlin
Re: (Score:2)
Re: (Score:2)
$19k doesn't sound like much of a bargain.
Re: (Score:2)
Well, yes. If I really wanted to be cool about it, I might consider going to Radio Shack, and buying an Anduino. Then use the 4 outputs, plus a couple shift registers, to make something that could program an 80c51XA. Then design my algorithm to go on those, plugged together such that they'd outperform even an Nvidia.
Or, even cooler, I might program the 80c51XAs in parallel, one being the calculations chip, and one handling all the i/o from one unit to the other. Then I could write a massively parallel p
Re: (Score:2)
I just noticed there's no "spam" option for modding posts. /. should add that.