Forgot your password?
United States Hardware

World's Fastest Supercomputer To Be Built At ORNL 230

Posted by timothy
from the right-near-dollywood dept.
Homey R writes "As I'll be joining the staff there in a few months, I'm very excited to see that Oak Ridge National Lab has won a competition within the DOE's Office of Science to build the world's fastest supercomputer at Oak Ridge National Lab in Oak Ridge, Tennessee. It will be based on the promising Cray X1 vector architecture. Unlike many of the other DOE machines that have at some point occupied #1 on the Top 500 supercomputer list, this machine will be dedicated exclusively to non-classified scientific research (i.e., not bombs)." Cowards Anonymous adds that the system "will be funded over two years by federal grants totaling $50 million. The project involves private companies like Cray, IBM, and SGI, and when complete it will be capable of sustaining 50 trillion calculations per second."
This discussion has been archived. No new comments can be posted.

World's Fastest Supercomputer To Be Built At ORNL

Comments Filter:
  • by Anonymous Coward on Wednesday May 12, 2004 @09:46AM (#9125969)
    Intallation is dependent on disk speed not mips. This computer lends itself more towards computional problems like solving RSA keys or finding new primes.
  • Re:good stuff (Score:5, Informative)

    by adam872 (652411) on Wednesday May 12, 2004 @09:51AM (#9126014)
    Some problems are easily partitioned up and distributed to separate nodes. In particular, code where the nodes do not need to talk to each other much are ripe for clusters, as the interconnect speed is less important. Therefore, you can build a commodity cluster fairly cheaply.

    For other problems, where interprocess/node communication is high or very high, you need a high speed interconnect (like NUMAflex in SGI's) to get you the scalability you need, as you increase the number of processors/nodes and the size of the data set increases. The big systems like Crays and the bigger SGI's and IBM Power series have those high speed interconnects and will allow you to scale more efficiently than the clusters. They cost a lot more though :)

    A good book to read on the subject of HPC is High Performance Computing by Severance and Dowd (O'Reilly). It's a little old now, but it covers a lot of the concepts you need to know about building a truly HPC system (architecture as well as code).
  • Re:Maybe it's me. (Score:5, Informative)

    by henryhbk (645948) on Wednesday May 12, 2004 @09:51AM (#9126020) Homepage
    Yes, DOE is the Federal Government's Department of Energy. Oak Ridge is a large federal govt. lab.
  • 2 Years? (Score:3, Informative)

    by XMyth (266414) on Wednesday May 12, 2004 @09:58AM (#9126072) Homepage
    I don't think Crays that were build 5 years ago are considered obsolete by anyone's standards.

    Clusters solve different jobs than supercomputers. Sometimes they bleed into one another, but there are some things supercomputers will always be better at (because of higher memory bandwidth for one thing).

  • by freelunch (258011) on Wednesday May 12, 2004 @10:01AM (#9126089)
    They were listed as part of the solution.

    Oak Ridge has done extensive evaluations of recent IBM, SGI and Cray technology. Though I am still looking forward to data on IBM's Power5.

    Cray X1 Eval []
    SGI Altix Eval []
  • 3D torus topology (Score:5, Informative)

    by elwinc (663074) on Wednesday May 12, 2004 @10:05AM (#9126124)
    I checked out the topology of the Cray X1; they call it an "enhanced 3D torus." A 3D torus would be if you made an NxNxN cube of nodes, connected all ajacent nodes (top, bottom, left, right, front, back), and then connected all the processors on one face thru to the opposite face. I can't tell what an "enhanced" torus is. (Each X1 node, by the way, has four 12.8 gflop MSPs, and each MSP has eight 32-stage, 64 bit floating point pipelines.)

    So each node is directly connected to six ajacent nodes. Contrast this with the Thinking Machines Connection Machine CM2 topology, which had 2^N nodes connected in an N dimensional hypercube. [] So each node in a 16384 node CM2 was directly connected to 16 other nodes. There's a theorem that you can always embed a lower dimensional torus in an N dimensional hypercube, so the CM2 had all the benefits of a torus and more. This topology was criticized because you never needed as much connectivity as you got in the higher node-count machines, to CM2 was in effect selling you too much wiring.

    Thinking Machines changed the topology to fat trees [] in the CM5. One of the cool things about the fat tree is it allows you to buy as much connectivity as you need. I'm really surprised that it seems to have died when Thinking Machines collapsed. On the other hand, any kind of 3D mesh is probably pretty good for simulating physics in 3D. You can have each node model a block of atmosphere for a weather simulation, or a little wedge of hydrogen for an H-bomb simulation. But it might be useful to have one more dimension of connection for distributing global results to the nodes.

  • by stratjakt (596332) on Wednesday May 12, 2004 @10:08AM (#9126159) Journal
    Didn't Cray make some comparison about supercomputers vs clusters being like a tractor trailer vs a fleet of honda civics?

    The civics might be fine for couriers, but if you need to move - say - an elephant they're useless.

    Analogies suck, though, and I'm pretty sure I got that one wrong.
  • by Uhlek (71945) on Wednesday May 12, 2004 @10:08AM (#9126162)
    Clusters are not the be-all end-all of supercomputers. Clusters are really only effective if you have a problem that can be paralellized -- or split into multiple parts that can each be worked independently of one another and then merged into a single result. Factorization, rendering, etc. are all examples of easily paralellized operations.

    Certain operations, though, are highly dependant upon each previous result. Physics and chemical simulations are a good example. When you have situations like this, clusters don't do you a lot of good, since only one iteration can be worked on at a time -- leaving most of your cluster sitting there idle.
  • by flaming-opus (8186) on Wednesday May 12, 2004 @10:08AM (#9126163)
    Two radically different designs, will probably solve very different sorts of problems. Linpack is extremely good at giving a computer an impressive number. It's the sort of problem that fills up execution piplines to their maximum. Blue Gene was origionally designed to do protein-folding calculations. While many other tasks will work well on that machine, others will work very poorly.

    It's a mesh of a LOT of microcontroller-class processors. The theory being that these processors give you the best performance per transistor. Thus you can run them at a moderate clock, get decent performance out of them, and cram a whole hell of a lot of them into a cabinet. It's a cool design, I'm interested to see what it will be able to do, once deployed. However, for the problems they have at ORNL, I'm sure the X1 was a better machine. Otherwise they would have bought IBM. They already have a farm of p690s, so they have a working relationship.
  • by compupc1 (138208) on Wednesday May 12, 2004 @10:36AM (#9126432)
    Supercomputers usually run some flavor of UNIX -- Unicos, IRIX, I think even Linux. In any case, they are specially built and designed for the supercomputer. Supercomputers are used for highly specialized scientific applications, and as such the programs would be specially written in Fortran, C, or Assembly, and often specially optomized for the architecture.
  • by flaming-opus (8186) on Wednesday May 12, 2004 @10:39AM (#9126451)
    ORNL already has a 256 processor X1, a large IBM SP made of p690s, as well as a large SGI altix. I imagine the 50Tflops number will be a combined system with upgraded systems of all three types. They are obviously impressed with both the X1 and the Altix. The IBMs are no slouch though, and they are upgrading the interconnect, and IBM is just getting ready to launch a power5 update.

    It's probably just spin to call the project "A computer", rather than "several computers". Deep in one of those ORNL whitepapers you see that they are planning to cluster together these three machine's with a cluster filesystem. You throw in a clustered batch control system and you can kinda call it "A" supercomputer. Really it's a cluster, except each of the nodes may have a thousand processors. We'll have to wait and see what it really looks like.
  • by flaming-opus (8186) on Wednesday May 12, 2004 @10:49AM (#9126546)
    The SGI altix runs a hacked up version of linux that's part 2.4 with a lot of backported 2.6 stuff as well as the Irix scsi layer. They are migrating to a pure 2.6 OS soon. The IBM system runs AIX 5.2. The Cray runs Unicos, which is a derivative of Irix 6.5, though they seem to be moving to Linux also. I'm gonna geuss that they run totalview as their debugger. They use DFS as their network filesystem. They have published plans to hook all these systems up to the Stornext filesystem which does Heirchical Storage Management. MPI and PVM are likely important libraries for a lot of their apps.

    For these sorts of machines, one can by utilities for data migration, backup, debugging, etc. However, the production code is written in-house, and that's the way they want it. Weather forcasting, for example, uses software called MM5, which has been evolving since the Cray-2 days, at least. A lot of this code is passed around between research facilities. It's not open source exactly, but the DOD plays nice with the DOE, etc.

    The basic algorithms have been around for a long time. In the early 90's, when MPPs and then clusters came onto the schene, a lot of work was done in structuring the codes to run on a large number of processors. Sometimes this works better than other times. Most of the work isn't in writing the code, but rather in optomising it. Trying to minimize the synchronous communication between nodes is of great importance.
  • by Waffle Iron (339739) on Wednesday May 12, 2004 @10:54AM (#9126591)
    I'm sorry dude, but this macine is going to have more than 1 CPU in it, and the work will have to be split among the processors and ran in parallel.

    The number of processors isn't as important as the memory architecture. Clusters of workstation-class machines have isolated memory spaces connected by I/O channels. Many non-clustered supercomputers have a single unified memory space where all processors have equal access to all of the memory in the system. This can be important for algorithms that heavily use intermediate results from all parts of the problem space.

    Even so, for a given number of FLOPS, a vector machine would generally require fewer CPUs than a cluster of general-purpose machines. This reduces the amount of splitting that has to be done to the problem in the first place.

  • Not quite... (Score:1, Informative)

    by chudmung (693605) on Wednesday May 12, 2004 @10:59AM (#9126633) Homepage
    ASCI Purple (IBM Power5) is capable of 100 teraflops. The Blue Gene/L machine is capable of 367 teraflops.

    This press release is almost 18 months old, btw... es/news/ pressreleases/2002/nov/asci_purple.html

    Maybe the headline "fastest -unclassified- supercomputer" would be more fitting.
  • by bigjocker (113512) * on Wednesday May 12, 2004 @11:00AM (#9126651) Homepage
    There are still a few computing problems that can't be efficiently split into a large number of subproblems that can be executed in parallel. For those cases, a cluster of small machines won't help.

    (Score:-10, Wrong)

    I'm sorry dude, but this macine is going to have more than 1 CPU in it, and the work will have to be split among the processors and ran in parallel.

    (Score:-100, Wronger)

    Sorry, but you have it all wrong. The parent is right. The parent stated that there are problems that can't be split in smallest problems for being handled by a cluster of computers. A cluster is a set of computers that work independant of each other and have the ability ro comunicate at ethernet speeds (10 - 100 - 1000 Mbits / Sec). There are problems that cant be solved using this approach, for example calculations where all processes must reuse the same data; with really big data sets the network connections become bottle-neck.

    For those kinds of problems (the usual example is a simulation of a nuclear explosion, a star system, etc) you need a single machine with loads of processors sharing the same memory space. That's where supercomputers come to play.
  • by Jeremy Erwin (2054) on Wednesday May 12, 2004 @11:05AM (#9126694) Journal
    Big Mac was tested in a small 128 node configuration as a prelude to the full 1100 nodes.

    The 128 node cluster was benchmarked at ~80% efficiency, or ~1.6 Teraflops. The final cluster achieved a RMax of 10.28 TFlops, ~60% of the 17.6 TFLOP theoretical peak.

    A 6000 node cluster would be very difficult to manage.
  • Re:Huh? (Score:3, Informative)

    by flaming-opus (8186) on Wednesday May 12, 2004 @11:16AM (#9126812)
    The important part of the statement is "Sustaining". There are a lot of computers out there on the top500 list that get peak numbers way ahead of their sustained numbers. An Army reseach center ( published a comparison of a xeon cluster and the X1. For their codes (weather simulation, material sciences, air flow, etc) the Xeons sustained performance was 5% of peak. The Cray was about 30% of peak. (this is probably due to the really awesome memory bandwidth of the cray)

    You're correct that these are just numbers so lets talk about a real problem. The AHPCRC reported that a 32 processor cray X1 (peak 400 Gigaflops, 66 gflops realized) was able to simulate a weather model of the entire US with 33 vertical levels at 5Kilometer resolution in just under 2 hours. Today these models are done at 10KM resolution with 20 levels. IF you take this theoretical ornl system and assume (peak 60-80TF, 40 sustained on easy codes, 15 sustained on hard codes) then they might do a 2KM simulation with 45 layers in 1 hour.

  • Folding@Home URL (Score:3, Informative)

    by bradbury (33372) <Robert,Bradbury&gmail,com> on Wednesday May 12, 2004 @11:55AM (#9127384) Homepage
    Sorry, it looks like the URL has changed. The home page for Folding@Home is here [].
  • by elwinc (663074) on Wednesday May 12, 2004 @01:43PM (#9129187)
    I believe the early crays implemented 64 bit floating point. Not IEEE floating point; no NaN or Inf codes, but still full precision.

    I believe the speed was due to many factors. Here are a few.

    (0) 64 bit word and and a ton of registers including eight 64 word vector registers.

    (1) very fast memory - at a time when many folks were using magnetic cores, Cray was using multi-transistor static RAM (like in the on-board caches of today's CPUs).
    (2) load - store instruction set. Many of the ideas that became popular in 1990s era RISC computers were present in the Cray 1 instruction set. One of the key ones is to separate instructions that read and write main memory from those that operate on data. That way, a program can start fetching data several cycles before the data is needed, and hide the fetch delay.
    (3) 16 banks of memory - each bank can handle a fetch independently; another way of overcoming memory latency.
    (4) Freon cooling! - does this make Seymour the first overclocker?!
  • by flaming-opus (8186) on Wednesday May 12, 2004 @03:23PM (#9130664)
    well, 0-4 are all true.

    comparing this to early crays is a little difficut though. For the early crays one advantage was vectors and the other was pipelines.

    vector processors are cool, because they tend to be much more tolerant of the latency. You issue a load command, and it does loads until the vector-register is full. Equivalent to dozens of loads (and dozens of round trip latency to memory) on a scalar architecture. The same thing applies to the execution units. You tell the CPU ADD R1 R2 R3, and it pumps the first elements of R2 and R3 registers through the ALUs and into R1 and keeps working until it gets through all of the elements in the vector. Later models supported chaining, which allowed the output from one of these operations to feed into the input of another operation. Vector CPUs are very good at keeping the ALUs busy.

    The other advantage of the early crays was pipelining. YMP designs, for example, had multiple integer, FP, load/store, and reciprical devide units. All of these (and the dispatch unit) were pipelined, allowing a munch higher clock rate than traditional designs. Multi-pipeline designs are now the norm, (powerPC, Pentium, MIPS, etc.) but were pretty amazing at the time.

    The cooling, incidently, was necessary at any clock rate. Early Crays. (well right on through to the T90) used bipolar transistors, rather than CMOS. In this sort of logic you switch current rather than switching voltage. The net result is that the early crays used a TON of electricity and needed massive cooling systems.
  • by Richard Mills (17522) on Wednesday May 12, 2004 @05:24PM (#9132484)
    "Any tin-foil hats should be directed at Y-12. That's the DOD plant; X-10 is just DOE."

    You're right, but let me clarify something:

    The biggest weapons labs in the country are DOE, not DOD facilities. These are the "tri-labs": Los Alamos, Lawrence Livermore, and Sandia. They are operated by the DOE's NNSA (National Nuclear Security Administration).

    The other major DOE labs (including ORNL) are operated by the DOE's Office of Science. These are non-weapons labs. For you conspiracy theorists out there, its pretty obvious that these are non-weapons labs. No guys standing around with M-16's etc., as you would find at a place like Los Alamos. Much, much less security.

Machines that have broken down will work perfectly when the repairman arrives.