AMD Demonstrates "Teraflop In a Box" 182
UncleFluffy writes "AMD gave a sneak preview of their upcoming R600 GPU. The demo system was a single PC with two R600 cards running streaming computing tasks at just over 1 Teraflop. Though a prototype, this beats Intel to ubiquitous Teraflop machines by approximately 5 years." Ars has an article exploring why it's hard to program such GPUs for anything other than graphics applications.
well, it shouldn't be (Score:4, Funny)
OK, yes, bad pun, bad spelling, you can "-1 get a real sense of humor" me now.
Compatibility (Score:2)
Even if Nvidia's CUDA is as hard as the Ars Technica article suggests, I still hope AMD either makes their chips binary compatible, or makes a compiler that works for CUDA code.
Re:Compatibility (Score:5, Interesting)
Re: (Score:2)
Re:Compatibility (Score:5, Informative)
Even if Nvidia's CUDA is as hard as the Ars Technica article suggests, I still hope AMD either makes their chips binary compatible, or makes a compiler that works for CUDA code.
From what I saw at the demo, the AMD stuff was running under Brook [stanford.edu]. As far as I've been able to make out from nVidia's documentation, CUDA is basically a derivative of Brook that has had a few syntax tweaks and some vendor-specific shiny things added to lock you in to nVidia hardware.
Re: (Score:2)
Hey, thanks -- I was wondering if something like that existed! I'm actually about to start working on a computer vision-related research project that might be well-suited to running on a GPU, and was trying to figure out what technology to use to write it. I think Brook might be it.
Re: (Score:2)
ubiquitous (Score:5, Insightful)
Look up 'ubiquitous' before you whine about how far behind Intel might seem to be.
Though having one demonstration will help spur the demand, and the demand will spur production, I still think it'll be five years before everybody's grandmother will have a Tf lying around on their checkbook-balancing credenza, and every PHB will have one under their desk warming their feet during long conference calls.
Re: (Score:2)
Look up 'ubiquitous' before you whine about how far behind Intel might seem to be.
Sorry, late night submission. I'll claim an error of verb tense rather than adjective usage: "this will beat" rather than "this beats". This silicon is shipping high-end in a couple of weeks, so it'll be mid-range this time next year and integrated on the motherboard the year after that (or thereabouts). Another year or two for the regular PC replacement cycle to churn that through, and it should be widespread by the time
Re: (Score:2)
Not misleading at all (Score:2)
I mean, the PS3 does 2 Teraflops! OMG, they're like 20 years ahead of Intel, who are so RUBBISH.
And what would be the theoretical floppage of, say, a Intel Core 2 Extreme with 2 x nVidia GTXs in a dual SLI arrangement using CUDA? I'm willing to bet it would be somewhat higher than this setup.
Re:Not misleading at all (Score:5, Interesting)
Maybe soon but I thought it isn't _now_!
Re: (Score:2)
Excellent point! Expect to see a nVidia/Intel partnership in 5, 4, 3, 2...
Re:Not misleading at all (Score:4, Insightful)
Re: (Score:3, Informative)
Sarcasm suits you well.
While Intel and nVidia may both be independently reinventing the wheel right now, neither seems to be getting very far very fast. Intel's vid
Re: (Score:3, Insightful)
Step 1 (Score:3, Funny)
Step 2 (Score:5, Funny)
Re: (Score:3, Informative)
1 Teraflop you say? (Score:3, Funny)
Re: (Score:2)
Re:1 Teraflop you say? (Score:5, Funny)
That's TWELFTY BAJILLION BogoMIPS. Per fortnight.
Re: (Score:2)
Re: (Score:2)
So "teraflop" is a unit of computational acceleration? Cool.
Re: (Score:2)
Re: (Score:2)
Does dual-core give you BOGOFMIPS?
Never thought of that (Score:3, Interesting)
Re: (Score:3, Informative)
It is up to date and contains a lot of related information.
WP
Re: (Score:2)
Re:Never thought of that (Score:5, Informative)
It's still in beta AFAIK, but it has been in development for quite some time.
OOOoooo (Score:5, Interesting)
It might be hard, but then again, it might be worthwhile. For instance (I'm a ham radio operator) I ran into a sampling shortwave radio receiver the other day. Thing samples from the antenna at 60+ MHz, thereby producing a stream of 14-bit data that can resolve everything happening below 30 MHz, or in other words, the entire shortwave spectrum and longwave and so on basically down to DC.
Now, a radio like this requires that the signal be processed; first you separate it from the rest, then you demodulate it, then you apply things like notch filters (or you can do that prior to demodulation, that's very nice) you build an automatic gain control to handle amplitude swings, provide a way to vary the bandwidth and move the filter skirts (low and high) independently... you might like to produce a "panadapter" display of the spectrum around the signal of interest where the is a graph that lays out signal strengths for a defined distance up and down spectrum... you might want to demodulate more than one signal at once (say, a FAX transmission into a map on the one hand, and a voice transmission of the weather on the other.) And so on - I could really go on for a while.
The thing is, as with all signal processing, the more you try to do with a real-time signal, the more resources you have to dedicate. And this isn't audio, or at least, not at the early stages; a 60+ MHz stream of data requires quite a bit more in terms of how fast you have to do things to it than does an audio stream at, say, 44 KHz.
Bit signal processing typically uses fairly simple math; a lot of it, but you can do a lot without having to resort to real craziness. A teraflop of processing that isn't even happening on the CPU is pretty attractive. You'd have to get the data to it, and I'm thinking that would be pretty resource intensive, but between the main CPU and the GPU you should have enough "ooomph" left over to make a beautiful and functional radio interface.
There is an interesting set of tasks in the signal processing space; forming an image of what is going on under water from sound (not sonar... I'm talking about real imaging) requires lots and lots of signal processing. Be a kick to have it in a relatively standard box, with easily replaceable components. Maybe you could do the same thing above-ground; after all, it's still sound and there are still reflections that can tell you a lot (just observe a bat.)
The cool thing about signal processing is that a lot of it is like graphics, in a way; generally, you set up some horrible sequence of things to do to your data, and then thrash each sample just like you did the last one.
Anyway, it just struck me that no matter how hard it is to program, it could certainly be useful for some of these really resource intensive tasks.
Re:OOOoooo (Score:5, Insightful)
Re:OOOoooo (Score:5, Insightful)
Simple: they aren't available. PC's don't typically come with DSPs. But they do come with graphics, and if you can use the GPU for things like this, it's a nice dovetail. For someone like that radio manufacturer, no need to force the consumer to buy more hardware. It's already there.
Re: (Score:3, Interesting)
Get started here [fpga4fun.com] and find some example DSP cores here [opencores.org].
Re: (Score:2)
If you were going to go to that kind of trouble, why not buy a chip (or entire board) designed to be a DSP? Why go the FPGA route? Not trying to be nasty, I assume you have a reason for suggesting this, I just don't know what it is.
Re: (Score:2)
Re: (Score:2)
Really? I haven't seen PC-insertable FPGA dev boards that are capable of clocking anything like as high as a modern GPU (i.e. typically ~800MHz) for sub-$1000. If you can point me in the direction of a reasonably-priced
Re: (Score:2)
For a processor, the minimum clock speed required is
(rate of incoming data) * (# of instructions to process a unit of data) / (average number of instructions per clock cycle, aka IPC)
For a nicely pipelined hardware design, you could theoretically get away with a clock rate equal to the rate of incoming data, or even less, if you can process more than one unit of data per clock and have a separate, higher-clocked piece capturing the inpu
Re: (Score:2)
Re: (Score:2)
We use computers to do things that it really isn't the best at doing, but we use the computer because it is so flexible at doing so many things and cheaply, wheras a DSP in a specialized box may be better for a specific single task, the economies of scale come into play.
Re:OOOoooo (Score:5, Insightful)
Re:OOOoooo (Score:5, Informative)
Maybe there aren't any DSP available and low cost, if you aren't a hardware designer:
400 MHz DSP $10.00 http://www.analog.com/en/epProd/0,,ADSP-BF532,00.
14-bit, 65 MSPS ADC $30.00 http://www.analog.com/en/prod/0,,AD6644,00.html [analog.com]
Catching non-designers talking smack
Re: (Score:2)
Re: (Score:2, Insightful)
NOTE:
Not the cost of the units, but the cost of doing anything useful with them. For a person NOT integrating the parts into mass-produced items, it's only suitable for people doing something simple as a hobby, or for learning. I would *guess* that building anything to solve a problem in practice would take an incredibly large amount of time and skill, both of whi
Re:Not sonar? (Score:4, Insightful)
You use ambient sound instead of radiating a signal yourself, and you try to resolve the entire environment, rather than just the sound emitting elements in the environment. This makes you a lot harder to detect; it also makes resolving what is going on a lot more difficult. Hence the need for lots of CPU power. In the water or in the air. Passive sonar - at least typically - is intended to resolve (for instance) a ship or a weapon that is emitting noise. But the sea is emitting noise all the time - waves, fish burping, whale calls, shrimp clicking - all kinds of noise, really. Using that noise as the detecting signal is the trick, and it isn't very similar to normal sonar, in terms of what kind of computations or results are required. Classic sonar gives you a range and bearing; this kind of thing is aimed at giving you an actual picture of the environment. It's a lot harder to do, but man, is it cool.
Re: (Score:2)
Lots more peaceful without the noise of sucking on a tank.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Sure, and you can go a lot deeper too...but I still find the lack of gear and quiet of freediving much more relaxing. Not that scuba sometimes isn't fun.
Re: (Score:2)
Interesting, but how do you theorically do all that? Using hydro/microphone arrays? And what kind of processing does it involve? Cross-correlations? I'd be interested with technical details (I've been programming DSP programs for a couple of years now)
Re: (Score:2)
Yes, I have. Great pointers; thanks.
Re: (Score:3, Interesting)
I think he's talking about something more along the lines of what they're calling a 3D/4D ultrasound. That doesn't mean much unless you've recently had a child, so here's an example from GE [gehealthcare.com] (requires flash). For a non-flash example, just google for 4d ultrasound [google.com] and try a few of the links...
The images are not in color, and sometimes you lose detail as an elbow (thi
The first rule of teraflop club... (Score:5, Insightful)
And the second rule of teraflop club...
Don't mention the wattage...
Back here in the real world where we PAY FOR ELECTRICITY, we're waiting for some nice FLOPS/Watt, keep trying guys.
And they announced this some time ago didn't they?
Also (Score:3, Interesting)
Re:The first rule of teraflop club... (Score:5, Informative)
We've run several PC clusters and IBM mainframes that didn't have a 1TF of capacity. You don't want know much power went into them. Yes, our modern blade-based clusters are more condensed, but they're still power hogs for dual and quad core systems.
Blue gene is considered to be a power efficient cluster and the fastest [top500.org], but it still draws 7kw per rack of 1024 cpus [ibm.com]. At 4.71 TF per rack, even Blue Gene pulls 1.5kw per teraflop.
Yes, it's a pair of video cards, and not a general purpose cpu, but your average user doesn't have ability to program and use a Blue Gene style solution either. They just might get some real use out of this with a game Physics Engine that taps into this computing power.
This is cool.
Re: (Score:3, Informative)
But for ~$500, it's what's going to be used.
Re: (Score:2)
It isn't that they are hard to use for more... (Score:4, Informative)
The actual framework for doing this is relatively simple although it certainly did help that I've a background in OpenGL and DirectXGraphics (so I've done shader work before); however, again, progress is removing those caveats as well. Generic GPU programming toolsets are imminent the only problem being ATI has no interest in their toolsets working with nVidia and nVidia has even less interest in their toolset(s) running ATI hardware. Something we'll just have to learn to deal with.
BTW, DirectX10 will make this a little easier as well with changes to how you have to pipeline data in order to operate on it in a particular fashion.
Notpick (Score:4, Informative)
Re: (Score:2)
I don't think so. You can either use 1 teraFLOPS, 2 teraFLOPS, 3 teraFLOPS (in the same way you say 1 MHz, 2 MHz, 3 MHz), where I am not using capitals for emphasis but as the way the letters should be written, or you can use 1 teraflop, 2 teraflops, 3 teraflops (in the same way you say 1 snafu, 2 snafus, 3 snafus). The thing is that "FLOPS" is an acronym (i.e. an abbreviation formed fro
Re:Notpick (Score:4, Funny)
Worthless Preview (Score:3, Insightful)
It also included some pictures of the cooling solution that will completely dominate the card. Not that a picture of a microchip with "R600" written on it would be a lot better I guess. Although the pictures are fuzzy and hard to see, it looks like it might require two separate molex connections just like the 8800s.
Re: (Score:2)
Aren't G5 PowerPC Macs rated at 1 TF already? (Score:2)
But then maybe the issue depends on the notion of what is "ubiquitous" and Macs don't qualify. I dunno, but I'm sure someone on
dave
Re: (Score:2)
IIRC the best case for Altivec is 8 flops/cycle (fused multiply/add of 4 32-bit floats), so a quad G5 at 2.5GHz would have a maximum of 80 GFlops. With perfectly scheduled code you could get some additional ops out of the integer and FP units, but not close to a teraflop.
HTX (Score:2)
Re: (Score:2)
How long until AMD starts releasing multi-core chips with multiple/mixed CPU/GPU cores, joined by an virtual inter-core HT bus, and all wired into main memory? (and optionally a bank of GDDR)
I could use it to program my automatic toaster (Score:2, Funny)
Wait...from some of the other comments about electricity usage, I might be able to do away with the heating coils and use the circuits themself to toast. That would really be an environment plus. Wonder how it would affect the taste of the bread?
Re: (Score:2)
It might be kinda cool to get "Intel Inside" burnt onto a panini sandwich...
dave
Re: (Score:2)
General Purpose Programmers (Score:4, Informative)
"Anything other" is "general purpose", which they cover at GPGPU.org [gpgpu.org]. But the general community of global developers hasn't gotten hooked on the cheap performance yet. Maybe if someone got an MP3 encoder working on one of these hot new chips, the more general purpose programmers would be delivering supercomputing to the desktop on these chips.
Re: (Score:2)
Re: (Score:2)
There are many more people coding multistream MP3 servers, but still no port to GPGPU.
Video servers follow the same logic. But video decoders at the client will get better economics from many thousands/millions of ASICs in the mass market, rather than the few thousand servers a year that the market will
Re: (Score:2)
Re: (Score:2)
http://www.geomerics.com/ [geomerics.com]
No, Ars didn't say why. Here's why. (Score:5, Informative)
Ars has an article exploring why it's hard to program such GPUs for anything other than graphics applications.
No, Ars has an article blithering that it's hard to program such GPUs for anything other than graphics applications. It doesn't say anything constructive about why.
Here's an reasonably readable tutorial on doing number-crunching in a GPU [uni-dortmund.de]. The basic concepts are that "Arrays = textures", "Kernels = shaders", and "Computing = drawing". Yes, you do number-crunching by building "textures" and running shaders on them. If your problem can be expressed as parallel multiply-accumulate operations, which covers much classic supercomputer work, there's a good chance it can be done fast on a GPU. There's a broad class of problems that work well on a GPU, but they're generally limited to problems where the outputs from a step have little or no dependency on each other, allowing full parallelism of the computations of a single step. If your problem doesn't map well to that model, don't expect much.
Added caveat: (Score:2)
Re: (Score:2)
As for your specific comments about the classes of problems that do or don't map well onto a GPU, I've covered those issues in previous posts on the topic. The post you're trying to criticize wasn't about the kinds of problems that you can and can't solve efficiently with GP
Re: (Score:3, Informative)
Re: (Score:2)
Chip in a Box (Score:5, Funny)
It's easy to do, just follow these steps:
One: Cut a hole in a box
Two: Stick your chip in that box
Three: Make her open the box
And that's the way you do it
It's my chip in a box
What about using it for Graphics? (Score:2)
Just how much of X and OpenGL could they offload on this card?
What Theora, Ogg, Speex, or Divx encoding and decoding?
I know it is a radical idea but since they are optimized for graphics and graphics like operations why not use them for that?
Re: (Score:2)
Re: (Score:2)
SuperCell (Score:3, Informative)
The Cell itself is notoriously hard to code for. If just some extra effort can target the nVidia, that's TWO TeraFLOPS in a $500 box. A huge leap past both AMD and Intel.
Re: (Score:2)
The problem is that you can't just say, "I can multiply two floating points in time X, and therefore my speed is 1/X." You have to actually get that data to and from some sort of useful location. High performance computing is bounded by memory bandwidth these days, not clock speed. The article summary mentions streaming but I can find no reference to that in the the actual article itself.
Consider digital SLR cameras, decent dSLR can take a picture in 1/160
Re: (Score:3, Informative)
Re: (Score:2)
Well...duh (Score:5, Insightful)
See, in the early days FPU was a seperate chip (anyone remember buying an 80387 to plug into their mobo?). Writing code to use FPU was also a complete pain in the ass, because you had to use assembly, with all the memory management and interrupt handling headaches inherent. FPUs from different vendors weren't guaranteed to have completely compatible instruction sets. Because it was such a pain in the ass, only highly special purpose applications made use of FPU code. (And, it's not that computer scientists hadn't thought up appropriate abstractions to make writing floating point easy. Compilers just weren't spitting out FPU code).
Then, things began to improve. The FPU was brought on die, but as an optional component (think 486SX vs 486DX). Languages evolved to support FPUs, hiding all the difficulty under suitible abstractions so programmer could write code that just worked. More applications began to make use of floating point capabilities, but very few required a FPU to work.
Finally, FPU was brought on die as a bog standard part of the CPU. At that point, FPU capabilities could be taken for granted and an explosion of applications requiring an FPU to achieve decent performance ensued (see, for istance, most games). And writing FPU code is now no longer any more difficult than declaring type float. The compiler handles all the tricky parts.
I think GPGPU will follow a similar trajectory. Right now, we're in phase one. Use a GPU for general purpose computation is such an incredible pain that only the most specialized applications are going to use GPGPU capabilities. High level languages haven't really evolved to take advantage of these capabilities yet. And yes, it's not as though computer scientists don't have appropriate abstractions that would make coding for GPGPU vastly easier. Eventually, GPGPU will become an optional part of the CPU. Eventually high level languages (in addition to the C family, perhaps FORTRAN or Matlab or other languages used in scientific computing) will be extended to use GPGPU capabilities. Standards will emerge, or where hardware manufacturers fail to standardize, high level abstraction will sweep the details under the rug. When this happens, many more applications will begin to take advantage of GPGPU capabilities. Even further down the road, GPGPU capabilities will become bog standard, at which point will see an explosion of applications that need these capabilities for decent performance.
Granted, the curve for GPGPU is steeper because this isn't just a matter of different instructions, but a change in memory management as well. But I think this kind of transition can and will eventually happen.
Future plans (Score:5, Funny)
So I take it that AMD will be ready for Vista's successor?
ubiquitous (Score:2)
I get a kick out of the tags people assign ... (Score:2, Insightful)
Wake me up when (Score:2)
Re: (Score:3, Funny)
Re: (Score:2)
Screw it, I prefer BSD anyway.
Re: (Score:2)
I'm soo ready for the weekend I'm starting at the end of words, then jumping back to the beginning to type the rest of them.
Re: (Score:2)
Can I buy a motherboard with this Tflop technology integrated?
Apples and oranges. I suspect fanboyism.
Re: (Score:2)
Intel is likely only in the graphics business so they can offer OEMs a complete package: motherboard with integrated everything, all made by Intel.
Re: (Score:2)
I have no idea how it ended up here. I didn't have this story open yet when posting this. Ohh well. Shit happens..lol