Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Hardware

Cray SV1 Named Best Supercomputer for 2001 171

zoombat writes "The BBC reported that the Cray SV1 product line won the Readers' Choice Award for Best Supercomputer for 2001 by the readers of Scientific Computing & Instrumentation magazine. These beasts have some pretty remarkable stats, including a 300 Mhz CPU clock, up to 192 4.8 GFLOPS CPUs or 1229 1.2 GFLOPS CPUs, and up to a terabyte of memory. And they sure know how to paint 'em real nice. Of course, we all know how "scientific" the Readers' Choice Awards are..."
This discussion has been archived. No new comments can be posted.

Cray SV1 Named Best Supercomputer for 2001

Comments Filter:
  • Re:No. (Score:2, Informative)

    by the gnat ( 153162 ) on Sunday August 12, 2001 @09:34AM (#2109985)
    I won't waste my breath.

    I will. Crays are vector supercomputers, which is something entirely different from your garden-variety Intel or RISC chip. There are several different types of computer you need to consider in the sort of comparison you're making:

    - Vector supercomputers. This includes Cray, and some by Fujitsu and Hitachi (perhaps NEC as well, but I think those are MIPS-based).
    - Massively parallel shared-memory supercomputers. The IBM SP2 and SGI Origin 2000/3000 come to mind. You take two of these, plug them into eachother, and get one computer twice the size with (I think) virtually no loss of bandwidth. I'm pretty sure these can also be connected just for high-bandwidth communications, but the real advantage is in shared memory. Cray makes these too, and SGI's MPPs are largely based on Cray technology (hence the "CrayLink" on Origins).
    - Distributed computers. Beowulf is just a set of patches (primarily to Linux) to make distributed-memory programming easier (e.g. utilizing multiple ethernet cards for higher bandwidth). You still have to write programs specially to take advantage of the machine.

    The difference lies primarily in programming techniques. You can not run a simple multithreaded program that would saturate an SP2 or Origin on a Beowulf cluster. You'd have to re-write it with PVM or something. PVM is not difficult, but it's not transparent. Some Fortran 90 compilers will do automatic parallelization, but not for a distributed-memory system.

    Basically, there's a hell of a lot more difference between a Cray and a Beowulf cluster than between the Beowulf cluster and the SETI@home network.

    -Nat
    ( disclaimer- I am not a supercomputer programmer, but a lot of the people I work with are. I do know something about parallel code, however. )
  • by Nater ( 15229 ) on Sunday August 12, 2001 @07:39AM (#2113677) Homepage
    Ye olde 8086 is much like the cannonical 1 cycle = 1 instruction CPU that you described. Since the minimum number of trasistors needed to execute an instruction is pretty much fixed (but occaisionally somebody somewhere figures out a way to reduce the number by a few), and the amount of time it takes for the signals to pass through a sequence of transistors is basically fixed (although better materials and smaller transistors can improve this), a 1 cycle = 1 instruction really just isn't capable of running at a high clock speed (Mhz).

    There are several ways to improve speed. The direction Intel went with their chips (and many other vendors as well) is pipelining. Pipelining is when you take that fixed number of transistors and break it into groups based on when they do their work. A 2-stage pipeline is one where the instruction logic is separated into two steps. A 3-stage pipeline is three steps, and so on. A sequence of four instructions in a 3-stage pipeline executes like this:

    1) The instruction is loaded and the first stage is executed in one clock cycle

    2) The next instruction is loaded and it is executed in the first stage while the the first instruction is executed in the second stage (one clock cycle)

    3) The third instruction executes in the first stage, the second instruction executes in the second stage, and the first instruction executes in the third stage (one clock cycle)

    4) The fourth instruction executes in the first stage, the third instruction executes in the second stage, and the second instruction executes in the third stage (one clock cycle)

    5) The fourth instruction executes in the second stage and the third instruction executes in the third stage (one clock cycle)

    6) The fourth instrction executes in the third stage (one clock cycle)

    So, as you can see, once the pipeline is filled, one instruction completes every clock cycle, but each instruction takes three cycles to complete. Neat trick, eh? There are a lot of hairy details to take care of between stages, and pipelined processors can get very complicated very fast, particularly if you're trying to implement an instruction set that wasn't designed for pipelined architechture (i.e. x86 instruction set).

    Cray went a different way. A Cray process is uses vector instructions to process a lot of data in one instruction. Compare this to the pipeline where multiple instructions are in progess during any single clock cycle. A vector processor, on the other hand, has large sets of registers which are referenced as a vector and has instructions that can fill an entire vector from a particular chunk of memory, add two vectors and store the results in a third, multiply, divide, negate, whatever, a vector at a time. And then of course there is an instruction to store the contents of a vector into a particular chunk of memory.

    Pipelining has the marketing advantage that if you make your pipeline long enough (the Pentium 4 is a 20-stage pipeline) then the stages take less time to execute and you can bump up the clock speed.

    Vector architechture does not have this marketing advantage, but they are historically superior for certain applications and data sets (like weather modeling meteorological data).
  • No. (Score:2, Informative)

    by KupekKupoppo ( 266229 ) on Sunday August 12, 2001 @04:01AM (#2118811)
    As fun as it is to try to tie everything to Beowulf clusters, it's not applicable and not necessary to bring up with every post. FWIW, not all tasks lend themselves well to being done in a distributed environment. Of course, that's been mentioned a few thousand times here before, so I won't waste my breath.
  • ... a 300 Mhz CPU clock, up to 192 4.8 GFLOPS CPUs or 1229 1.2 GFLOPS CPUs, and up to a terabyte of memory...

    That's a lot of GFLOPS :-), and a LOT of Ram.
    Im not an expert in CPU's but i've picked up a few things that maybe helps you.

    There are several ways of doing a CPU fast. You can (the very popular way) increase the clock frequency, thus doing more operations per second. One hertz equals one "cpu instruction" (sometimes they takes more then one, depending on what kind they are). This is the popular way to make a CPU sellable, unexperienced PC buyers sometimes simply focuses on "How many MHZ does this harddrive has ?" :-)
    The second way is closely connected to this, simply make more then one instruction per each clock frequency. This is working in parallell, a more complicated solution that helps in some types of operations, but not others. Some problems are not good for parallelizing.
    A CPU has something called a branch, [some have more then one, ie parallell processing] you can compare it to an assembly line in a modern factory. More pipes = parallell computing. For some reason, a short pipe [fewer operations until done] gives faster execution but lower clock frequencys, maybe because of heat or something. Could anyone fill me in here ? Anyhow, a cpu like the G4 [motorola/apple] has a rather short pipe, 4 or 5 steps. The P4 [intel] has a rather long one, 20 or so. This is why a P4 doesnt reach the same MHZ as the P4, but still can compete in raw computing power.

    You can also increase performance in a CPU by making special instruction sets the programmer can call, and then optimize those instruction sets. The Pentium++ for example, is a rather simple processor wrapped among a huge amount of addon instruction sets, like MMX, SSE, SSE2 (and many many more) etc. The wrapper hardware-compiles these advanced CPU-calls into the basic instructions the core CPU actually can understand.

    Hope I clearified somethings, and if I missed something or got something wrong, please correct me :-)

  • by green pizza ( 159161 ) on Sunday August 12, 2001 @07:23AM (#2132584) Homepage
    If your app requires lots of vector crunching, the SV1 [cray.com] is one hellofa machine that'll keep you more than happy. The specs (mentioned above) are staggering... up to 1 TB of RAM, up to 1229 CPUs, air and/or water cooled.

    However, it's not alone. There are some other pretty mighty machines out there. The NEC SX-5 [cray.com] has faster RAM and more powerful vector CPUs than the SV1, but does not scale as large. The SGI Origin 3000 [sgi.com] series is not vector, but rather a of (somewhat) traditional CPU design. It's available with up to 512 CPUs and 1 TB of RAM. Unlike both the SV1 and SX-5, the Origin can be ordered with graphics (which turns it into an Onyx).

    Then, there's the upcoming Cray SV2 [cray.com], which will be a combination of massive parallel & vector processing. Up to several thousand CPUs and a staggering RAM thruput of 250 GB/sec per bank!! (The Origin 3000 mentioned above has a total system bandwidth of 716 GB/sec.... but that's the entire machine. The SV2 will have more than that with just three banks of RAM alone).

    Some of these machines are single image systems (in the case of the Origin 3000, SX-5 and >33 CPU SV1)... meaning they are one single machine, not a cluster. Most run very specific OSes made just for their hardware, with the possible exception of the Origin. SGI's big Origin and Onyx 3000 machines run IRIX 6.5, the same OS that runs on a $150 e-bay special SGI Indy workstation. Kinda cool. The compilers and math libraries are also heavily tuned and generally come with lots of example code and performance tips. When my university purchased a 96 CPU Origin 2000 a few years ago, SGI included a *box* of binders and CDs from some past performance computing seminars they had held. Our university still holds a support contract for the Origin, and thus we're still getting significant compiler and library updates.

    Sort of belittles dual bank PC2600 DDR-SDRAM (2x 2.6 Gigabyte/sec = 5.2 Gigabyte/sec) and Myrinet (1 Gigabit/sec = 125 Megabyte/sec interconnect), doesn't it.

    Of course... a 16 node x86 cluster doesn't cost $500K - $50M either...
  • by Anonymous Coward on Sunday August 12, 2001 @05:03AM (#2137180)
    There's twice the of amount cache that helps to reduce the bus congestion. Most scientific code consists of several tight loops and can be made to make very efficient use of the cache by paying attention to the order in which the variables are accessed within the loop.
  • If its anything like the older Crays (SV1 stands for "scalable vector", iirc its sort of a mix of vector and traditional CPUs).. then it gets its speed from the vectorized nature of the cpu and more importantly, the problem at hand.

    i was told in a CS course that the arch of the cray vector units is basically the same as the cray 1... the speeds have changed, the process has changed, the external peices have gotten much faster.. but at the core, the cray vector machines are very fast at the following type of thing:

    given a vector of a given length

    do foo to every element in that vector

    _very_ efficiently

    to see how this operates a bit better, consider how a normal cpu might do the following

    for i = 1 to 64

    begin

    blah[i] = blah[i] + 1

    end

    that would end up getting compiled perhaps into something like this on a traditional cpu:

    loop:

    load blah[i]

    increment blah[i]

    save blah[i]

    increment i

    if i 64, goto loop

    what we're seeing is that for 1 element, we do a load, an ALU op, a store, an ALU op, and a conditional branch.

    conditional branches fuck cpus. badly. having load stores inside inner loops, fucks cpus badly.

    to see why, you need to understand pipelining, but basically i'll make it short and easy: the instruction cache of a cpu is always stuffing the pipeline with its "guess" of what instructions should be... and its not until several of those 1.4ghz clock cycles later that you even know if you've got the right instruction... if you do, great.. if you dont, you're fucked and you flush the pipeline and start over.

    conditional branches fuck this all to hell because without optimization, you've got a 50% chance of filling your pipeline with the wrong instructions.. so on a p4 with a 20+ stage pipeline you're talking about throwing away some sizable portion of those instructions... and then refilling them... now, branch predition realy helps this a lot, but conditional branches are just one problem... the load/store units of cpus also typically introduce huge pipeline delays... i.e. you need to load blah[i] but that takes 2 or 3 cycles (even from cache!! dont even think about it if you need to go to main memory) so any instructions which use blah[i] must be scheduled at least 2-3 clock cycles aftewrads...

    so without keen optimization and ideal software loads, suddenly your 1.4ghz chip is stalling 2-3 instructions all the time.. and its only running like a 400mhz proc :)

    so, to make traditional cpus fast, pipelineing and multiple EUs have been added. these have drawbacks (and i'velisted some of pipelinings above).

    the "vector" approach is totally different. you actually have "vector" registers, and "vector instructions". the machine actually sets up "virtual" pipelines for you. so on a vector machine, the scenario above would be more like:

    vectorsize=64

    xv = xv + 1

    (assuming xv is the vector register with your 64 elements in it)

    what the cray hardware does is hooks up the peices of its cpu in a virtual pipeline that does something like this:

    foreach element of vx

    load

    inc

    save

    notice that the foreach construct looks like a loop, but its not realy, its pipelined, so what actually gets sent through looks like this

    load i

    inc i, load i+ 1

    save i, inc i + 1, load i + 2

    save i+1, inc i + 2, load i + 3

    save i + 2, inc i + 3, load i + 4

    save i + 3, inc i + 4, load i + 5

    etc etc etc

    except for fill and drain, the load, inc, and save hardware units are always perfectly utilized. there is no branching or conditional logic involved.

    the example i've chosen is very trivial, and may be subject to huge factual or conceptual mistakes :) the cray's amazing speed only works in situations where the problem can be expressed in vector instructions, i.e. do the same thing to a fuckload of data in such a way that the cray's hardware can pipeline it efficiently..

    there are lots of interesting problems that the cray did _not_ handle well.. but for what its worth, the vector processors in the cray 1 aren't significantly different in operation and instruction set than the SV1 of today.. by many measures, cray "got it right" originally. the SV1 of today might use a normal BGA packaging on a CMOS based process, (the cray1 used discrete ECL logic and point to point wiring - all strung together by little old minnesotan women)

    also the original cray 1 ran at either 100 or 80mhz, could take 32mb of ram.... i.e. for the 1970s it was faster than any desktop workstation until the mid 90s...

    note that the top500 list crays are usually the T3Es.. which are a totally different beast than the vector processor.. a T3E is just a bunch of alpha CPUs on a very fast interconnect.. sort of like a "custom cluster in a box".

  • by sprong ( 98748 ) on Sunday August 12, 2001 @07:22AM (#2144266)
    "For some reason, a short pipe [fewer operations until done] gives faster execution but lower clock frequencys, maybe because of heat or something. Could anyone fill me in here ?"

    Each stage in the pipeline lets the hardware work on the instruction a bit, to setup register access and whatnot. Quite a few of the steps in modern x86 processors are 'unwrapping' the CISC instruction and turning it into RISC. (This is a bit simplified). The more steps there are, the shorter (less time) each step can be, letting the clock rate go up. Fewer steps means (generally) that each step needs more time, therefor limiting clock speed.

    Long pipelines have one drawback, though. Assume there's one instruction currently being executed. The next one, in memory, will be in the stage that's one back. The next instruction after that will be in the stage before THAT, and so on. This works most of the time, where you have many sequential steps in a row. However, if there's a branch, the pipeline has to be flushed; it'll take at least as many clockcyles as there are stages in the pipeline before any instructions start getting actually executed; there's a lag time there while the instructions are making there way from the start to the end of the pipeline. There may/will be overhead on top of that which can make the stall time greater than if there was no pipeline at all.

    So, back to yer original question, a high-MHZ deep-pipelined chip can be slower than a lower-MHZ shallow-pipelined chip IF there are a lot of branches in the program, because each branch will require a pipeline flush, which takes a lot of time to recover from. Speculative branching helps out a lot here, but it's not 100percent accurate, and also requires a lot of silicon to deal with.

    All the extra real estate on the chip dedicated to the logic for deep pipelines could be, instead, dedicated to speeding up operations or extra cache or whatever. But x86 chips need fargin' deep pipelines these days to get high MHZ numbers, or else each complicated CISC instruction would take a year or so to decode.

You knew the job was dangerous when you took it, Fred. -- Superchicken

Working...