Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Programming Hardware IT Technology

IBM Releases Cell SDK 207

derek_farn writes "IBM has released an SDK running under Fedora core 4 for the Cell Broadband Engine (CBE) Processor. The software includes many gnu tools, but the underlying compiler does not appear to be gnu based. For those keen to start running programs before they get their hands on actual hardware a full system simulator is available. The minimum system requirement specification has obviously not been written by the marketing department: 'Processor - x86 or x86-64; anything under 2GHz or so will be slow to the point of being unusable.'"
This discussion has been archived. No new comments can be posted.

IBM Releases Cell SDK

Comments Filter:
  • by Anonymous Coward on Thursday November 10, 2005 @12:39PM (#13998526)
    Yup, it is.
  • by frankie ( 91710 ) on Thursday November 10, 2005 @12:41PM (#13998540) Journal
    ...the Cell processor is an upcoming PowerPC variant that will be used in the PlayStation 3. It's great at DSP but terrible at branch prediction, and would not make a very good Mac. If you want to know full tech specs, Hannibal is da man [arstechnica.com].
  • by stienman ( 51024 ) <adavis&ubasics,com> on Thursday November 10, 2005 @12:52PM (#13998654) Homepage Journal
    The Cell processor is essentially a multi-core chip. It has, IIRC, one "master" CPU, and then multiple slave CPUs on the same die.

    A modern desktop computer has one master CPU, then several smaller CPUs each running their own software. Graphics, Sound, CD/DVD, HD, not to mention all the CPUs in all the peripherals.

    But the analogy ends there. The Cell has certian limitations and wouldn't be able to operate as a full computer system with no other processors very efficiently. I believe the PS3 has a seperate GPU, for instance. And doubtless has many other microcontrollers managing the rest of the system.

    -Adam
  • by plalonde2 ( 527372 ) on Thursday November 10, 2005 @12:55PM (#13998686)
    You are wrong. These SIMD processors do loops just fine. There's a hefty hit for a mis-predicted branch, but the branch hint instruction works wonders for loops.

    The reason you want to unroll loops is because of various other delays. If it takes 7 cycles to load from the local store to a register, you want to throw a few more operations in there to fill the stall slots. Unrolling can provide those operations, as well as reduce the relative importance of branch overheads.

  • by morgan_greywolf ( 835522 ) on Thursday November 10, 2005 @01:15PM (#13998895) Homepage Journal
    That looks more like syntactic sugar to me. How is that different? More importantly, how would that translate differently into assembler code? You pretty much will wind up with the same thing, that is: "do your thang, increment the accumulator, if the accumulator equals the count, jump to do your thang."

    gcc and other compilers have options such as -funroll-loops, which will unroll loops (no matter how they were specified) for you if the count can be determined at compile time. So you wind up with "Do your thang, do your thang, do your thang, do your thang ... Do your thang". You get the idea.

  • Re:GNU toolchain (Score:4, Informative)

    by Have Blue ( 616 ) on Thursday November 10, 2005 @01:24PM (#13999002) Homepage
    IBM may have run into the same problems with the Cell that they did with the PowerPC 970- the chip breaks some fundamental assumptions GCC makes, and to add the best optimization possible it would necessary to modify the compiler more drastically than the GCC leads would allow (to keep GCC completely platform-agnostic).
  • by ashSlash ( 96551 ) on Thursday November 10, 2005 @01:27PM (#13999028)
    It's "per se".
  • by tomstdenis ( 446163 ) <tomstdenis AT gmail DOT com> on Thursday November 10, 2005 @01:31PM (#13999075) Homepage
    GCC can unroll all loops if you want including those with variable itteration counts. In those cases it uses a variant of duff's device. [well on x86 anyways].

    As for the other posters, the real reason you want to unroll loops is basically to avoid the cost of managing the loop, e.g.

    a simple loop like

    for (a = i = 0; i b; i++) a += data[i];

    In x86 would amount to

    mov ecx,b
    loop:
    add eax,[ebx]
    add ebx,4
    dec ecx
    jnz loop

    So you have a 50% efficiency at best. Now if you unroll it to

    mov ecx,b
    shr ecx,1
    loop:
    add eax,[ebx]
    add eax,[ebx+4]
    add ebx,8
    dec ecx
    jnz loop

    You now have 5 instructions for two itterations. That's down from 8 you would have before, and so on, e.g.

    mov ecx,b
    shr ecx,2
    loop:
    add eax,[ebx]
    add eax,[ebx+4]
    add eax,[ebx+8]
    add eax,[ebx+12]
    add ebx,16
    dec ecx
    jnz loop

    Does 7 opcodes for 4 itterations [down from the 16 required previously, e.g. 100% more efficient].

    Tom
  • Re:GNU toolchain (Score:4, Informative)

    by Wesley Felter ( 138342 ) <wesley@felter.org> on Thursday November 10, 2005 @01:32PM (#13999098) Homepage
    The SDK includes both GCC and XLC. GCC's autovectorization isn't the greatest, but Apple and IBM have been working on it. I think if you want fast SPE code you'll end up using intrinsics anyway.
  • Cell Hardware... (Score:4, Informative)

    by GoatSucker ( 781403 ) on Thursday November 10, 2005 @01:35PM (#13999133)
    From the article:
    How does one get a hold of a real CBE-based system now? It is not easy: Cell reference and other systems are not expected to ship in volume until spring 2006 at the earliest. In the meantime, one can contact the right people within IBM [ibm.com] to inquire about early access.

    By the end of Q1 2006 (or thereabouts), we expect to see shipments of Mercury Computer Systems' Dual Cell-Based Blades [mc.com]; Toshiba's comprehensive Cell Reference Set development platform [toshiba.co.jp]; and of course the Sony PlayStation 3 [gamespot.com].

  • by Animats ( 122034 ) on Thursday November 10, 2005 @02:25PM (#13999729) Homepage
    The "cell" processors have fast access to local, unshared memory, and slow access to global memory. That's the defining property of the architecture. You have to design your "cell" program around that limitation. Most memory usage must be in local memory. Local memory is fast, but not large, perhaps as little as 128KB per processor.

    The cell processors can do DMA to and from main memory while computing. As IBM puts it, "The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data." So the cell processors basically have to be used as pipeline elements in a messaging system.

    That's a tough design constraint. It's fine for low-interaction problems like cryptanalysis. It's OK for signal processing. It may or may not be good for rendering; the cell processors don't have enough memory to store a whole frame, or even a big chunk of one.

    This is actually an old supercomputer design trick. In the supercomputer world, it was not too successful; look up the the nCube and the BBN Butterfly, all of which were a bunch of non-shared-memory machines tied to a control CPU. But the problem was that those machines were intended for heavy number-crunching on big problems, and those problems didn't break up well.

    The closest machine architecturally to the "cell" processor is the Sony PS2. The PS2 is basically a rather slow general purpose CPU and two fast vector units. Initial programmer reaction to the PS2 was quite negative, and early games weren't very good. It took about two years before people figured out how to program the beast effectively. It was worth it because there were enough PS2s in the world to justify the programming headaches.

    The small memory per cell processor is going to a big hassle for rendering. GPUs today let the pixel processors get at the frame buffer, dealing with the latency problem by having lots of pixel processors. The PS2 has a GS unit which owns the frame buffer and does the per-pixel updates. It looks like the cell architecture must do all frame buffer operations in the main CPU, which will bottleneck the graphics pipeline. For the "cell" scheme to succeed in graphics, there's going to have to be some kind of pixel-level GPU bolted on somewhere.

    It's not really clear what the "cell" processors are for. They're fine for audio processing, but seem to be overkill for that alone. The memory limitations make them underpowered for rendering. And they're a pain to program for more general applications. Multicore shared-memory multiprocessors with good cacheing look like a better bet.

    Read the cell architecture manual. [ibm.com]

  • by Anonymous Coward on Thursday November 10, 2005 @02:36PM (#13999882)
    Actually, its 256k. One for each SPU. That 2mb total. Not bad if you ask me..
    d
  • by hr raattgift ( 249975 ) on Thursday November 10, 2005 @03:59PM (#14000905)
    Perhaps something like writing in tail recursive style to help out an optimising compiler?...


    You have this backwards. Optimizing compilers will turn tail-recursive style source into "normal" loops.

    You can write a loop recursively, so that:
    foo() {
        int x=8;
        int b=1;
        while(x > 0) {
          b << 1;
          --x;
        }
        return b;
    }
    becomes
    foo() {
      return foo-helper(10, 1);
    }
    foo-helper(int x, int b) {
      if(x <= 0)
          return b;
      else
          return foo-helper(--x, b << 1);
    }
    Recursion in foo-helper is in the tail position. That is, foo-helper only calls itself as the final operation before returning.

    Compiling this naively involves a function call per recursion, which on most architectures results in pushing data onto the stack. However, because we are doing tail-recursion, we can do a tail call elimination optimization.

    How this works is that the "return" before the recursion is taken to mean that any automatic variables are dead, any stack space used for the arguments is reusable, and the recursive call is really a jump.

    That is, when foo-helper calls itself, it really does an argument rewrite and jump, which in effect "pretends" that foo-helper was called with different arguments in the first place.

    In other words, tail call elimination turns recursive loops into iterative loops.

    Writing in "tail-recursive style" just means making sure your recursion is done in tail position (i.e., attached to a "return"). Some compilers for a variety of languages can identify recursion which is not done in the tail position, and reorder the recursion into tail position (and then the tail calls are eliminated into iterative loops). However, many compilers can't, and many more don't do tail-call elimination at all. :-(

    Once you've optimized recursive loops into iterative ones, you can optimize iterative loops however you like, including partially or fully unrolling them.

    In summary, recursion is a way of looping, but function calls are not free. In particular, they usually consume stack space. If you only return the result of your recursion, then you are tail-recursing. Tail recursion can be turned into code which does not incur function-call overhead.

  • by taracta ( 217357 ) on Thursday November 10, 2005 @04:03PM (#14000969)
    I think too much emphasis is being placed on "slow" access to system memory for the CELL processor when is is "slow" only relative to access to local memory of the SPUs. Please remember that system memory for the CELL is about 8 times faster than the memory in todays high end PCs with lower latency. XDR is by far the best memory type available unfortunately nobody like RAMBUS the company. So please when you are speaking about access to system memory keep in mind that the CELL processor has about the same memory bandwith has top of the line Graphics cards and probably lower latency. Don't you wish your PC had the bandwith of top of the line Graphics cards?
  • by frostfreek ( 647009 ) on Thursday November 10, 2005 @04:19PM (#14001166)
    > It's not really clear...

    There was a Toshiba demo, showing 8 Cells; 6 used to decode forty-eight HDTV MPEG4 streams, simultaneously, 1 for scaling the results to display, and one left over. A spare, I guess?

    This reminds me of the Texas Instruments 320C80 processor; 1 RISC general purpose cpu, plus four DSP-oriented CPUs. Each had an on-chip memory chunk. 4KB. 256KB would be fantastic, after the experience of programming for the C80. 256KB will be plenty of memory to work on a tile of framebuffer.

    1. DMA tile -> local RAM
    2. render to local...
    3. ???
    4. Profit!

    Whoops, where was I going with that, again?

  • Not a PPC Processor (Score:2, Informative)

    by MJOverkill ( 648024 ) on Thursday November 10, 2005 @04:36PM (#14001356)

    Once again, the cell is not a PPC processor. It is not PPC based. The cell going into the playstation 3 has a POWER based PPE (power processing element) that is used as a controller, not a main system processor. Releasing an SDK for Macs would not give any advantage over an X-86 based SDK because you are still emulating another platform.

    Wiki [wikipedia.org]

  • by Anonymous Coward on Thursday November 10, 2005 @06:14PM (#14002394)
    "The PS3 has a GPU from nVidia in it - the Cell won't be doing the rendering itself, so it's free to do things like AI and physics calculations."

    WTF?

    Just what the world needs, another clown from the peecee world talking about the PS3.

    There is no 'GPU' in the PS3. The entire Cell+RSX unit is used to render. The RSX would best described as the PS3's rasterizer, but even that isn't entirely accurate since there is a large amount of painting/modifying pixels that the SPEs do. Physics and graphics data is unified and processed on the Cell side of the system, although the RSX does have vertex transform capabilities itself.

    PS3 rendering is best described as a hybrid rendering system where rendering is load balanced between the internal components on the fly depending on the unique characteristics of the scene and world data being processed.

    So, no, there isn't a NVidia 'GPU' in the PS3...

  • by Wesley Felter ( 138342 ) <wesley@felter.org> on Thursday November 10, 2005 @09:30PM (#14004040) Homepage
    "Power Architecture" is PowerPC.

    What is Power Architecture technology? [ibm.com]

    "Power Architecture is an umbrella term for the PowerPC® and POWER4(TM) and POWER5(TM) processors produced by IBM, as well as PowerPC processors from other suppliers."
  • by pbohrer ( 930124 ) <pbohrer.us@ibm@com> on Thursday November 10, 2005 @10:18PM (#14004312)
    The simulator is actually maintained on a number of different platforms within IBM. Since the rest of the SDK team (xlc, cross-dev gcc, sample & libs, etc) chose Fedora Core 4 on x86 as a means of enabling the most number of people, we didn't want to confuse too many people by supplying the simulator on a variety of platforms for which the rest of the SDK is not supported. This was somewhat of a big-bang release of quite a bit of software to enable exploration of Cell. Now that we have this released and the open source side of the SDK is available on the web, I am sure people will have no problem adapting that build environment to be hosted on Linux/PPC. In support of that, we will be providing a Linux/PPC version of the Cell simulator soon on alphaWorks.
  • by Animats ( 122034 ) on Saturday November 12, 2005 @05:49PM (#14016995) Homepage
    That's not what Sony is saying:

    SCEA press release:

    SONY COMPUTER ENTERTAINMENT INC. AND NVIDIA ANNOUNCE JOINT GPU DEVELOPMENT FOR SCEI'S NEXT-GENERATION COMPUTER ENTERTAINMENT SYSTEM> [playstation.com].

    TOKYO and SANTA CLARA, CA
    DECEMBER 7, 2004
    "Sony Computer Entertainment Inc. (SCEI) and NVIDIA Corporation (Nasdaq: NVDA) today announced that the companies have been collaborating on bringing advanced graphics technology and computer entertainment technology to SCEI's highly anticipated next-generation computer entertainment system. Both companies are jointly developing a custom graphics processing unit (GPU) incorporating NVIDIA's next-generation GeForce(TM) and SCEI's system solutions for next-generation computer entertainment systems featuring the Cell* processor".

Today is a good day for information-gathering. Read someone else's mail file.

Working...