IBM to use Cell in Blade Servers 159
taskforce writes "IBM announced on Wednesday that it would be putting versions of its Cell processor inside its increasingly popular low-power blade servers by this summer. From the article: 'For Cell to gain wide acceptance, IBM needs to spur outside programmers to write software that takes advantage of Cell's prowess. That could prove more challenging than usual because Cell's architecture is so different.
IBM hopes this summer's release of the Cell-based servers kick-starts work by third-party programmers.'" Also covered in a PCPro article.
Where have I heard this before? (Score:3, Insightful)
Deja vu? [wikipedia.org]
Big Difference Between Itanium and Cell (Score:1)
The cell processor, on the other hand offers such a giant increase in performance (for some applications) that you will see people investing time and money to take advantage of it. In addition, with Toshiba, Sony and IBM all with product plans and thus the related volume and eco-system surrounding development tools, etc., I think the cell is positioned far better than Itanium to succeed.
Re:Big Difference Between Itanium and Cell (Score:4, Funny)
D'oh.
Re:Big Difference Between Itanium and Cell (Score:3, Funny)
-nB
Re:Big Difference Between Itanium and Cell (Score:5, Informative)
The Itanium is different than this in that it required instructions to be passed to the CPU as "bundles". Any of the instructions in a bundle could be executed in any order, but these instructions were all from the same application. Thus, in order to extract speed from the Itanium, the compiler was forced to extract parallelism from within functions. This is very difficult since most programming is fairly sequential. The Cell, on the other hand, allows you to execute different tasks and so puts this control back on the programmer instead of extra work for the compiler.
Itanium was (is) a great idea from compiler theory perspective, but doesn't work out all that well (yet) in the real world.
Re:Big Difference Between Itanium and Cell (Score:3, Informative)
So while you do still have to program differently for a cell with 8+1 cores than you would for a computer with 9 Power processors, it's still not like being stuck with just 9 DSPs.
Re:Big Difference Between Itanium and Cell (Score:2, Funny)
"multi-core DSPs" WITH CRIPPLED FPUs!!! (Score:5, Informative)
Actually, the bigger difference is in how the architecture changed. Cell processor is more along the lines of multi-core DSPs.
Standard computer graphics are RGB color at 24-bits per pixel [2^24 = 16777216], i.e. about 16 million colors.
Standard thinking in the graphics bidness is that: If our triangles will only be displayed in 24-bits worth of color, then why do we need to perform triangle-arithmetic in anything higher than maybe 32-bits worth of floating points?
Hence floating point calculations are 24-bit in the ATi world, and 32-bit in the nVidia and Playstation3/Cell world.
Boy, I hope they're upping that floating point number for these "server" chipsets, cause 32-bit single-precision floats are essentially worthless for even something as trivial as computing interest on a bank statement.
On the other hand, a "Cell" server CPU with a 128-bit FPU would be something to drool over. The problem, though, is that transistor counts on FPU's tend to increase as n^2, so each time you double the FPU bit-count [to 64-bits, then to 128-bits], your transistor count goes through the roof.
Re:"multi-core DSPs" WITH CRIPPLED FPUs!!! (Score:2, Insightful)
No floating point involved -- at all...
Now for 3D Graphics, coordinates may be represented in floating point. But during rendering, the values are converted to 8-bit integer values for Red, Green, and Blue components of each pixel.
And financial calculations are computed using INTEGER arithemetic....
A lot of things that might appear to require floating point, can often be implemented u
Well inform it then. (Score:2)
That has to be one of the most ill informed comments that I have seen on this site in the past 6 months.
The bulk of the post is approximately three sentences; there's a further addendum which asserts that transistor counts on FPUs do not double as the bit count on the floats doubles [rather, the transistor count increases at a much larger rate].
If anything asserted here is factually false, then please take the time to correct it:
Re:Big Difference Between Itanium and Cell (Score:2)
Re:Big Difference Between Itanium and Cell (Score:2)
Yes, and I think that FORTRAN code performs quite well on Itanium. The problem is that C code, with its almost unrestricted use of pointers, doesn't lend itself easily to that sort of optimisation. If you have a chuck of code with lots of pointer references, the compiler will need to make some pretty big deductions on where those pointers could be pointing before it can hope to parallelise anything.
If you can do threads, then you can do Cell (Score:3, Insightful)
If you put the onus on the programmers, this chip won't get widespread acceptance.
If you can write a PC program that uses 10 threads, then you can write a program that uses the Cell processor's PPC and 7 DSPs. Trouble is that most computer science education in universities doesn't cover practical use of threads.
Re:Where have I heard this before? (Score:5, Informative)
Nice quip, but the realities of the situation are completely different. My take on EPIC nee IA-64 when it was first publicly announced was surprise at an architecture that actually encouraged ultra-complex processor control logic. This, when prevailing trends tended to find ways to manage or reduce that complexity, or at least provide unambiguous chip-compiler synergy. Put another way, Intel made design choices that made the hardware itself very challenging to build and properly synergize with a compiler to achieve high total performance. Intel had certainly shown their chops at this sort of high-complexity chip controller design in the x86 line, but the move still seemed brazen from an outsider's perspective. History now shows that they certainly had trouble going down that path...
Cell, however, is basically a bog-stock PowerPC with DSP engines at its disposal. Think Altivec/MMX/SSE type units on steroids. This approach provides computing power that isn't applicable to all tasks, but is generally proven to perform well for applications that require high performance mathematical processing. Incidentally, that's precisely the target market that IBM's stated they're after with Cell-based servers. Moreover, Cell's scalability model and hardware complexities are much more managable.
To really leverage Cell's power from the software side will require some or all of 1) good compiler and toolchain support, 2) good library support, and 3) dedicated development effort for the specific application. IBM has the expertise and motivation to provide 1 and 2, and developers in the supercomputing world tend to get really good at 3. When your *highly optimized* supercomputer app may take on the order of a year to run, big emphasis tends to be put on making it run fast. Months of work to save years of time.
It still remains to be seen how this effort will play out in the marketplace, but variants of Cell's basic approach are working right now in many, many devices.
Re:Where have I heard this before? (Score:2)
Re:Where have I heard this before? (Score:2)
You misspelled "HP". Itanium was originally going to be the new PA-RISC chip, (originally named 'PA-WideWord'). HP approached Intel when it became apparent that they wouldn't produce the volume of chips to make it profitable to upgrade their fab (which they would have to do to produce a chip of Itanium's complexity). So, enter Intel ca. 1994. Sun produced a version of Solaris for the new chip, IBM and SCO played together nicely (al
Re:Where have I heard this before? (Score:2)
Actually, the Power PC Unit (PPU) in a cell is a highly simplified streamlined Power PC and nothing at all like the PowerPC's you'll find in a G5 Mac. While it runs at a higher clock rate, it's missing lots of stuff like out-of-order execution and advanced branch prediction and has a much simpler load-store unit. For example, on Cell there are huge penalties for load-hit-store but on current gen Power PC's there is a unit
Cell will live long, but Niagara may not. (Score:2, Interesting)
Since the Cell is now integrated into the military apparatus of the best-funded military aparatus in the world, the Cell will live essentially forever. For the same reason, Ada (i.e. the computer language) will live forever even though few people in industry use the language.
By the way, Cell is also IBM's answer to Sun's Niagara. For years, Sun touted Niagara as a new revolution in computing: Niagara is supposedly the fi
Re:Cell will live long, but Niagara may not. (Score:2)
Noteworthy Information (Score:5, Informative)
Take a peek at http://www.research.ibm.com/cell/patents_and_publi cations.html [ibm.com] to see the patents and whitepapers for cell technology. One interesting point is the Online Game Prototype white paper on there.
http://religiousfreaks.com/ [religiousfreaks.com]Sun to use new chips (Score:5, Funny)
Re:Sun to use new chips (Score:1)
Oh wait...
I get it! +1, Funny for you.
Re:Sun to use new chips (Score:1)
Re:Sun to use new chips: DragonBall (Score:2, Informative)
http://www.freescale.com/webapp/sps/site/taxonomy
if you knew this, then fwoosh went the joke over my head
Re:Sun to use new chips: DragonBall (Score:2)
But, with dragonballs, you can ressurect things, right?
Re:Sun to use new chips (Score:2)
Not sure if you were joking, but Dragonball was already used. Motorola uses Dragonball for it's 68k embedded line. I never got a Dragonball Z sticker cool enough that i wanted to stick on my Palm IIIxe.
How about a free optimizing compiler (Score:5, Insightful)
It being command-line compatible with (or simply a back-end of) an existing compiler like gcc is even better.
Add a port of a good OS, and your platform is suddenly incredibly attractive to developers.
Re:How about a free optimizing compiler (Score:2)
-Rick
Re:How about a free optimizing compiler (Score:3, Insightful)
Re:How about a free optimizing compiler (Score:5, Insightful)
Get a good compiler and general-purpose OS up and running fast (which, by the way, I'm sure IBM is doing), and you'll see many more people writing special-purpose code where they need it.
Good point. Unfortunately ... (Score:5, Interesting)
In my opinion, this thing will run well games, but that's about it. I've seen so far 2 presentations by IBM about the Cell processor (at (micro-)architecture conferences). Both times, the question on everybody's mind was "How do you program these things?". The answer was pretty much a hand-wavy "oh hmmm, well, blah blah blah manual"
Re:Good point. Unfortunately ... (Score:2)
Re:Good point. Unfortunately ... (Score:2)
Re:Good point. Unfortunately ... (Score:2)
Re:Good point. Unfortunately ... (Score:3, Informative)
Re:Good point. Unfortunately ... (Score:2)
For scientific computing, I think the Cell's advantages will heavily depend on how many vectorizable loops are in the existing code. For example, in Molecular Dynamics (MD) code, one frequently calculates the forces between each pair of atoms (basically solving a = F/m over and over). Systems with N atoms have N*(N-1)/2 pairs so when N is large that loop requires significant calculations (this excludes cut off distances and other tricks). MD code is used to (try
Re:Good point. Unfortunately ... (Score:2)
Re:Good point. Unfortunately ... (Score:2)
Two words, one algorithm: MapReduce [google.com].
Re:How about a free optimizing compiler (Score:2, Informative)
Re:How about a free optimizing compiler (Score:4, Informative)
Second, optimizing compilers tend to optimize only small parts of linear code. Simply put, this comes down to filtering binaries and replacing inefficient code sequences by more efficient ones. Depending on the quality of the compilercore, this typically gains a few percent, occasionally some 25% but that's nowhere near what Cell could offer, namely (theoretically) 800%.
The problem is refactoring the problem to run in
- small chunks,
- independently (parallel)
- and on a specialized processor.
A compiler can help only modestly with the last point. In any non-trivial case, this means reanalyzing the problem and reimplementing the solution from the start, making different tradeoffs. That is why people say Cell is difficult.
IMHO, the benefits of code optimization will be close to irrelevant for almost any successful application on Cell over the coming years. And while Moore's law has provided us with bigger and faster hardware, we programmers are still mostly empty-handed when it comes to program translation for parallel architectures.
We need a paradigm shift, not an optimizing compiler.
Here you go. (Score:2)
Yellow Dog Linux runs on Cell. (Link [linuxdevices.com]; this is the same military product that is linked to in a Register article further up in the thread.) It's being marketed for semi-embedded uses, like in medical imaging systems, sonar and radar, etc., apparently.
Free Optimizing Compiler:
I have no idea whether there are any compiler optimizations for it in GCC, I suspect not, though. However there is a version of the IBM XL C compiler for it, available here [ibm.com] (no idea if registration is required, I didn't attempt to
Linux on Cell (Score:5, Insightful)
Anyone know of any specific server apps?
cell cells itself (Score:3, Funny)
She said, "Come on, juh know jouwant it!"
Sun has 'em beat (Score:5, Interesting)
I'm not quite sure what IBM is planning to do, but Sun has started a contest [java.net] to see who can build the coolest program that takes advantage of their new Coolthreads technology. The prize is a cool $50,000, so Sun seems to be serious about this. The results of the contest may very well prove whether the new parallel technologies have a future or not.
Re:Sun has 'em beat (Score:2, Insightful)
If Sun were really serious, they'd put a $500,000 team on it to develop something themselves. Paying for 1/3 - 1/2 a man-year of development is not that serious.
Re:Sun has 'em beat (Score:2, Interesting)
Re:Sun has 'em beat (Score:2)
Re:Sun has 'em beat (Score:4, Informative)
Yes. A Cell's SPUs are not PowerPC processors, so you can't run the same code on the PowerPC front end as you do on the SPUs. Not only that, but Cell and Niagara are designed for totally different things. Cell is designed for floating-point intensive apps with pretty poor general purpose capabilities, while a Niagara has 1 floating point unit shared between all 8 cores and 32 threads but they're all good at the branchy sort of thing servers ususally run.
I think these Cell servers will be more useful for things like render farms, They'll be essentially useless as generic servers for web or database duty.
Azureus or another BitTorrent program (Score:2)
Cell and T1 not targetting the same space (Score:2, Informative)
The Cell is designed for image processing and other high-volume number crunching.
The design decisions both companies made were heavily influenced by their target markets for these specific processors, and those target markets are very different.
These are apples and oranges.
Who woulda thunk it? (Score:4, Funny)
Your organs are specialized, too. (Score:5, Interesting)
It's fun to bash the Cell as a general purpose CPU when no one has actually suggested it's designed for that.
All of the above being true, it remains to be seen what gains IBM's POWER/Cell system actually offers above present architectures -- RISC was the next big thing, too, until Intel internalized part of it into the x86 architecture.
Flyover landscape graphics demos are a shopworn rabbit pulled out of a threadbare hat: convert fractals into craggy vertical displacements with extremely primitive lighting/mapping. Show me an architecture that can *realtime* render Incredibles-caliber cloth/hair simulations and I'll get a hard-on while ATI and nVidia executives slit their wrists.
Re:Your organs are specialized, too. (Score:2)
You mean specialized processors like FPUs, 3d audio accelerators, 3d video accelerators (and the sub-processing units contained in video accelerators), encryption and TCP offload engines, WinModems, MPEG encoder/decoders, and platform management controllers?
Yeah, they'll have a real hard time adjusting... In 1982.
Re:Your organs are specialized, too. (Score:2)
WinModems made the processor do the real work, they were the cheap crappy ones that sucked as a modem.
Re:Your organs are specialized, too. (Score:3, Informative)
Re:Your organs are specialized, too. (Score:2)
You don't see them, not so much because they're hard (they *are* hard), but because they're made even harder by the shortcuts taken by the typical 2D accelerator. A lot of a scene you see in "real time" on modern video hardware is pre-rendered in the form of texture maps and bump maps. Rather than shift paradigm, we have iteratively increased the number of transformations per frame that can be done on these pre-
Re:Your organs are specialized, too. (Score:2)
Re:Your organs are specialized, too. (Score:2)
Not really, developers have been using co-processors for years -- numeric (a la Weitek or 8087), DSP, odd-wad AI and "dataflow" boxes. And I imagine the early attempts will follow a similar pattern: present the functionality of the co-pro wrapped neatly in a library, then just call the library routines. Presto, your code is automatic
hardware abstraction? (Score:1)
Re:hardware abstraction? (Score:1, Offtopic)
You know, that's a really good idea [wikipedia.org]!
Re:hardware abstraction? (Score:2)
Re:hardware abstraction? (Score:3, Insightful)
Every parallel architecture I've ever programmed for had nice APIs for offloading and directing tasks to the various available processing units. There shouldn't be much 'hand-optimization' involved in the sense you're implying.
Developers who write code that takes advantage of GPUs in modern gaming PCs are already familliar with this style programming, and the ones that understand the architecture instead of memorizing the APIs or program out of a cook
Re:hardware abstraction? (Score:4, Informative)
But you can probably count on your fingers the number of developers who are using GPUs for anything other than rendering pixels, or at most some simple vectorizable simulations like water or cloth.
Taking an arbitrary program and turning it into something that would run well on a GPU (or a Cell SPU) usually requires a significant redesign of the algorithms and data structures as compared to what you would naively and straightforwardly do in C...or it won't get anywhere near peak performance and may even run slower. It's certainly possible to do, but you won't be re-using any of that originally written code, and it's a different way of thinking from what 95% of programmers are used to. I'm speaking from experience as someone who earns his living by being in the remaining 5%.
As the original poster said: you hand optimize (and design) your program for the cell.
Re:hardware abstraction? (Score:2)
And for good reason. GPUs are designed to render pixels, not do other stuff.
Taking an arbitrary program and turning it into something that would run well on a GPU (or a Cell SPU)
I don't understand why you think I'm saying that those two things are equivalent. Taking an arbitrary program and turning it into something that would run well on a GPU would be unusual. You're talking abou
Re:hardware abstraction? (Score:2)
IBM already has these tools available (Score:2, Informative)
PS3 release date? (Score:5, Insightful)
Re:PS3 release date? (Score:2)
There is always the chance that the RAM, GPU, Blu-Ray drive, or something else would end up in short supply.
Re:PS3 release date? (Score:3)
My guess is finished software.
Re:PS3 release date? (Score:2)
Exciting (Score:2)
Re:Exciting (Score:5, Funny)
i didn't need to know that.
Re:Exciting (Score:2)
I work in blade development. (Score:5, Informative)
Re:I work in blade development. (Score:2)
That's not uncommon for Pentium blades either. The socket increases the width, and that space is better used for cooling.
Re:I work in blade development. (Score:2)
Re:I work in blade development. (Score:2, Interesting)
Actually, they are similar to a number of DSPs and other discrete solutions from the past. For example:
The TMS 320DM64x [ti.com] series of DSP from TI which has an ARM9 and a number of DSPs on it.
The TMS 320DM54x and 55x [ti.com] series of DSP from TI which has an ARM7 and a number of DSPs on it.
And a descrete version in the CSPI MAP 1310/11 [vita.com] which had a PPC and multiple multi-core DSP chips on it as early as 1997.
Smaller blade chassis? (Score:3, Informative)
This doesn't mean make a desktop out of a blade, because as I understand it, so far the JS20s (IBMs PPC 970 blade) don't even have video cards. You have to set them up over the serial port, and run them over the network.
But does anybody have a development sized unit you don't need a server rack and new power circuits for?
Re:Smaller blade chassis? (Score:3, Informative)
Re:Smaller blade chassis? (Score:2)
Well, that's not usefull...
I guess it's time to start a blade chassis case mod
Re:Smaller blade chassis? (Score:2)
Re:Smaller blade chassis? (Score:2)
My ideal "blade" system mounts in standard 19" racks. 2 or 3 complete systems in 1U, 48VDC powered from another 1U transformer. Let me stack 1-8 of these in my rack and don't make me pay for the damn IBM/HP/etc chassis.
My system also has no storage inside the blades. Just give me 4 network interfaces per "blade", with at least 2 optionally capable of providing
Wow (Score:2)
But wait (Score:4, Funny)
*looks bright*
Why SPEs? (Score:4, Interesting)
Why didn't IBM just pack in a lesser number of PPEs? The PPE already seems to be a very lightweight general purpose processing core, unless I'm missing something. It is about the same size as an SPE. So why not just put 9 PPEs on a Cell chip instead of 1 PPE and 8 SPEs?
If you had 9 PPEs on the chip, any multithreaded code (servers for example) would see massive benefits without having to rewrite it to try to find aspects of the program that could run on what is effectively a DSP. While everybody else was fooling around with 2-core processors, they'd have a 9-core processor on the market. Sure, slower per-core, but 9 of them, with that number going up in the future.
Or am I missing something here?
Re:Why SPEs? (Score:2)
These are not for general purpose computing that is what the Power5 and the Power6 will be for. Think DSP, render farms, or simulation and not web or database servers.
You could create a system with a Power5 blade to do database and general purpose type stuff and have that feed multiple Cell blades to do rendering and or DSP.
A render farm jumps to mind but I could see it being used for military functions like Sigint, Radar, and Sonar or any number of scientific simulations.
Not every com
compute per silicon-area/watt/$ (Score:4, Informative)
Re:compute per silicon-area/watt/$ (Score:2)
I refer you to this image:
http://images.anandtech.com/reviews/cpu/cell/ppehi ghlight.jpg [anandtech.com]
Perhaps you mean the PPE and it's supporting hardware, such as the cache? That'd ideally be shared among multiple PPEs.
If you look closely at the PPEs, a huge amount of their real estate seems to go to what looks like their 256KB of cache. Cache takes up a lot of space. Since the PPE's wouldn't each have dedicated cache, they're stil
Re:Why SPEs? (Score:2)
That's my point. While I do not deny that there are uses where the raw number-crunching power of the SPEs is useful for certain tasks, I don't think that their uses are fairly limited.
Take the PS3 for example. From what developers are saying, the vast majority of what they have to do is limited to the single PPE. They have managed to find a few things that can run on the PPE, but not many. Physics engines are one th
Not untill (Score:2)
Nobody would be crazy enough to do that!
Re:Not untill (Score:2)
Two Tutorials (Score:3, Interesting)
1. A cell program that solves linear equations Ax=b efficently using SPE's. This would help those with data intensive problems.
2. A cell program that speeds up depth first search (a la for SAT,GRAPH COLORING, MAX-CLIQUE) by using the SPE's. This would help those programming CPU intensive problems.
Tutorials 3 and 4 (Score:2, Interesting)
There were a couple that would be really helpful:
1. An implementation of zlib for the SPE architecture, with a speed comparison to the PPE. (Hopefully, the SPE is very fast...)
2. Examples of direct SPE-to-SPE streaming.
Give me an ATX board (Score:4, Insightful)
Sold for under $100, and theyre making money off it while spreading the love that will increase the developer market for the cell architecture.
It goes like this. Make a new architecture. Release a good compiler for free.. with awesome documentation and sample programs and libraries. Allow people to buy evaluation boards for low prices. Once you get people hooked enough, sell the chips themselves at high prices. Its the Microchip (tm) model. Their chips dont really do much for the high costs (compared to atmel, TI etc) but since everyone knows how to work them, they sell sell sell. Rabbit semiconductors however are trying hard to get into the market, and their dev tools are cheap. It'll take time.
IBM cant release a couple o PDFs and one tough software suite and expect the world to jump on it. Theres a reason why theres so much momentum behind the Power architecture, and the Cell is different.
Nice names (Score:2)
Re:Increasingly Popular?! (Score:2)
Re:Increasingly Popular?! (Score:2)
Thanks.
Re:Increasingly Popular?! (Score:4, Funny)
Re:Increasingly Popular?! (Score:2)
Re:IBM and OSS (Score:2)