Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Hardware

Linux Supercomputer Wins Weather Bid 115

Greg Lindahl writes "The Forecast Systems Laboratory, a divison of NOAA, selected HPTi, a Linux cluster integrator, to provide a $15 million supercomputing system during the next 5 years. The computational core of this system is a cluster of Compaq Alphas running Linux, using Myrinet interconnect. Check outwww.hpti.com for information on the company. "
This discussion has been archived. No new comments can be posted.

Linux Supercomputer Wins Weather Bid

Comments Filter:
  • I work at a company that is working on a very complex artificial intelligence architecture, and for a variety of reasons it is written in Java (since the other most popular AI languages use VMs or are directly interpreted, expect the AI community at large to want good interpreters on Linux).

    We looked at putting together a Beowulf Linux cluster to run our software, which is very memory and processor intensive, but Linux could not do the job because JVMs on Linux are absolutely terrible. We wound up on WinNT (we couldn't afford Suns, but plan to upgrade when we can) because the JVMs were the best.

    Because people making large software systems are fed up with reengineering for new hardware, expect other people to start choosing Java for large, intensive applications that were previously written in C, Fortran, C++, etc.
    If Linux can't compete with other OSes for running large Java programs, these projects will not be able to consider Linux as their OS of choice (which we all WANTED to do here, we were very upset to go to NT).

    Right now the fastest Java environment we've found is Java 2 with HotSpot, running on NT (we're testing Solaris now, as we might be able to afford Suns soon). Can the Linux community do any better, or even as well? So far, no.
  • Regardless of what the G4 can do, it is important to remember that this is what a cluster of alphas can do today. The decision by any government agency takes time to make. Since this is a mission critical piece of equipment, I would have to believe that this is not vaporware.

    Perhaps in the future a cluster of G4's will be used. The gcc compiler should/may be generating more efficient in the future as improvements are being made. IIRC, apple is using gcc in the development of the forthcoming MacOS-X.

    Nonetheless, it is nice to see the federal government go this route.

  • Something is up with that link you gave. I know the k7 is superior then the p3 but if you compare the k7 vs the alpha you will find the k7 is twice as fast. hmmmm

    I am also looking for speed in powerpc g3 for a new powerpc linux box and the standard p3 was alomst twoce as fast. I am a former mac guy and I use to regard anything from apple in benchmarking as fact but either apple is really lieing (probably are) or this test is biased. I think you should found out what this test was trying to prove. Something is really screwed up.
  • Nice answer. I would like to add to what you said. There are many ways to solve PDE's, finite difference and finite element are the two major ones. Both can take major advantage of parallel processing systems. Essentially (or is this simplistically), one has to "solve" the PDE for that describe connecting nodes. After computing the solution for all the nodes, one then iterates and iterates. For finite difference, one has to compute until a stable solution is achieved for a particular configuration. For time dependent models, one then starts all over again for the next time increment.

    The major controlling factor is the model. For fluid dynamics, approximations are made to make the problem solvable. Stuff like, aOf course, the input parameters/data can play a major role. If the problem is chaotic, one has to run a whole bunch of scenarios to obtain a statistical model.

    My only dispute with what you said is that if the model is wrong, the results may be wrong. Running three models with limitations that yield the same result may not give you the right answer. Additionally, chaotic effects can lead to bad results.

    And to the idiot who commented about no advances in math, I would like to say that while the math (e.g., 1+1=2) may remain the same, the physical model may be different.

  • by Anonymous Coward
    FSL's proposal stated that Computational Performance was the primary evaluation measure.

    Scientific, vector processor tuned codes are known to run fastest on the Alpha 21264 + Tsunami memory chipset, so it is the only choice for a no-compromise, fastest computer in the world solution.

    Take a look at the benchmark numbers (albeit limited) on http://www.hpti.com/clusterweb/ for some initial results.

    Now, on the choice of Myrinet... This is a more interesting question.

    Any takers?

    No_Target
  • by Anonymous Coward
    Your point is well taken in that there is a need for I/O balance in all supercomputing systems due to the need to save the results, particularly in those calculations that involve dynamic phenomena, like weather. The faster the computer, the faster results come out of it.

    An enabler for cluster effectiveness is the Fibre Channel Storage Area Network, a technology that allows multiple hosts to read _and_ write to the same file at the same time at very high bandwidth.

    In fact, the I/O bandwidth of a cluster in this context is still limited by the speed of the PCI busses on one node if you are serializing the I/O to that one node. If this is the case, the XP1000 will sustain about 250+ MB/s with three-four Fibre Channel Host Bus Adapters on its two independent PCI busses. If your software can distribute the I/O to multiple nodes, like FSL's parallel weather forecasting API can (SMS), then your I/O bandwidth is essentially limited by your budget for RAID systems, Fibre Channel Switches and HBAs.

    No_Target
  • Sometimes the task may be complicated enough to require the resources of the entire VM, so compiling wouldn't bring any improvement.

    However, native code is always an improvement (instead of bytecode).

    Makes me wish Juice had been more sucessfull... (Juice was/is the platform-independant binary format used for Oberon; its loader translated it quickly to native code). The current equivalent of Juice is ANDF.

    -Billy
  • For the Weather Research and Forecasting Model, you might want a look at these links:
    http://www-unix.mcs.anl.go v/~michalak/ecmwf98/final.html [anl.gov], Design of a Next-Generation Regional Weather Research and Forecast Model

    http://nic.fb4.noaa.gov:800 0/research/wrf.98july17.html [noaa.gov], Dynamical framework of a semi-Lagrangian contender for the WRF model

    The design is for a hybrid-parallel design, in which the model domain is a rectangular grid split up into tiles, with each tile assigned to a (potentially shared-memory-parallel) node with either message passing or HPF parallelism between tiles; each tile is then broken up into patches, with OpenMP-style parallelism on the node. The WRF is targeting resolutions better than 10 km in the horizontal and 10 mb in the vertical -- so a regional forecast can expect grid sizes on the order of 300x300 horizontal x 100 vertical x 30 sec temporal, with research applications an order of magnitude finer yet. Note that computational intensity scales with the fourth power of the resolution (because of the dt-scaling issue), whereas memory usage scales with the cube. So high resolution forecasts are very compute-intensive, and improving the resolution to what we really want can chew up all available compute capacity for the foreseeable future.

    A few other thoughts:

    1. Not only are the Alpha 264s unmatched in terms of both floating point performance and memory bandwidth (although the next-generation PPC is very good in that regard also), they are also among the best at dealing with the data-dependencies and access-latencies which occur in real scientific codes.
    2. DEC^H^H^HCompaq probably has the best compiler technology of anybody out there commercially (IBM are also very good technically, but as Toon Moene of the Netherlands Met Office put it, "XLF was the first compiler I ever encountered that made you write a short novel on the command line in order to get decent performance."
    3. Note for AC # 68 State-of-the-art weather models are not spectral models. Spectral models are appropriate only for very coarse scales at which cloud effects are only crudely parameterized (and to some extent are only appropriate on vector-style machines (and not current microprocessor/parallel) because of the way they generate humongous vector-lengths). At the WRF scales, the flow is not weakly compressible! Note that the global data motion implied by the FFTs in hybrid spectral/explicit models is a way to absolutely kill scalability for massively parallel systems. Finally, spectral models do not support air quality forecasting, such as we are doing (see http://envpro.ncsc.org/projects/NAQP/ [ncsc.org]).
    4. Weather modeling is a problem which has exponentially-growing divergence of solutions (two "nearby" initial conditions lead to different solutions that diverge exponentially in time), so as coyote-san suggests, there is a tendency to run multiple "ensemble" forecasts, each of which is itself a computationally-intense problem. So far, I haven't managed to get the funding to develop a stochastic alternative (which will be a fairly massive undertaking -- any volunteers?) This means weather modeling can soak up all avaailable CPU power for the (foreseeable)^2 future. At least the individual runs in ensemble forecasts are embarassingly-parallel.
    An aside to LHOOQtius ov Borg: have you tried the GNU java compiler (now a part of the gcc system -- for the intensive apps, generating native machine code is much faster.

    Hi, Greg! Didn't know you were here!

  • The system is a Ruffian (21164a) at 633MHz w/ 256MB RAM, the installation is based on Red Hat 6.

    gcc is gcc version 2.95.1 19990816 (release). Compile time options: -O9 -mcpu=ev56

    ccc is Compaq C T6.2-001 on Linux 2.2.13pre6 alpha. Compile time options: -fast -noifo -arch ev56

    The benchmark consisted of running two scripts through the CGI version of PHP4 [php.net]. We compare user times as measured by time(1). The tests were run three times, the shown results are mean values. The scripts are available from the Zend homepage [zend.com]. PHP was configured with --disable-debug.


    Quicksort (script ran 50 times)

    ccc version: 27s
    gcc version: 30s

    Mandelbrot (script ran 50 times)

    ccc version: 35s
    gcc version: 39s


    The test shows that the code ccc produced was about 10% faster than gcc's. Other conclusions are left as an exercise to the reader.
  • Too bad Windows 2000 can't handle bad weather, otherwise it would of been the logical choice. ;)
  • Gotta admit I'm a little confused here. If you have a computationally intensive task, why would you ever want to run it through any VM or interpreter? Granted, LISP may run through an interpreter during the development phase, but you can always compile it to get better speed in the final product.

    Can you elaborate without giving away the company secrets?
  • I'm crying foul on the moderations I've been given on this story. It's true that the government finds ways to mess things up, e.g. crypto laws, software patents, etc.

    M2 has seemed to make moderations a bit more accurate, but I don't see it working out for me here. Unless somebody actually goes to the page [berkeley.edu] and sees what I'm talking about -- "Alpha" in ten hours, and the EV series are cranking out units faster than LensCrafters...

    I didn't make up those "CPU's". They are actually listed on the page! Please follow the link [berkeley.edu] and see for yourself.


    --
  • Linux-clusters are also getting to business solutions!
    Check out: http://linuxtoday.com/stories/10157.html

    .signature not found
  • towerj + linux is faster [internetworld.com]
  • Did I mention another of my graduate classes was chaotic dynamics? :-)

    The very definition of "chaos" is high sensitivity to changes in the initial conditions. If a weather front appears in the same place (within the resolution of the data grid) on all 120-hour forecasts despite a reasonable variation in the initial conditions, you can be pretty sure it isn't in a chaotic realm and your forecasts will be fairly accurate.

    On the other hand, if a modest amount of variation in the initial conditions result in wildly different predictions, the system is obviously in a chaotic realm and you can't make decent predictions.

    As odd as it sounds, for something as large as a planetary atmosphere it's quite reasonable for parts of the system to be chaotic while other parts are boringly predictable. That's why they were starting to compare the predictions from different models, the same models with slightly different initial conditions, etc. That might give the appropriate officials enough information to decide to evacuate a coastline (at $1M/mile), or to hold off another 6 hours since the computers predict the storm will turn away.

    P.S., the models do make mistakes, but fewer than you might expect. It's been years since I've thought about it, but as I recall most models work in "isentrophic" coordinates and are mapped to the coordinates that humans care about at the last step. The biggest problem has been the resolution of the grids; when I left I think the RUC model was just dropping to 60km; by now it's probably 40 or 30km. To get good mesoscale forecasts (which cover extended metro areas, and should be able to predict localized flooding) you probably need a grid with 5 or 10 km resolution.
  • > I don't know about now, but five years ago, a state-of-the-art code for weather forcasting used spectral approximations (Fourier or Chebychev expansion functions) in the X- and Y-directions (Latitude and Longitude, say) and some high-order ...


    Dude... I think you just compressed an entire episode of Star Trek into six sentences. :)
  • Usually, when one is investing in the kind of high end networking hardware necessary to make a clustered supercomputer, one uses FTP or NFS instead of floppies... Only an idiot would compile a program individually on each of 100 nodes of a cluster anyway.
  • SMS doesn't distribute I/O to multiple nodes for a single job. But the bandwidth of a single I/O node is sufficient for FSL's needs.
  • Absolutely. This is NOT a Beowulf cluster.

    Beowulf refers to the tools created at NASA Goddard CESDIS [nasa.gov]

    This cluster uses MPI and tools developed by the University of Virginia's Legion Project [virginia.edu]

    Beowulf has become, to some, a generic term for a Linux cluster, like Kleenex to tissues.

    Mark Vernon HPTi
  • FSL runs their RUC model globally with a 40km resolution today. They expect to run RUC globally with a 10km resolution on the new system. However, there is a lot of weather that wants even finer resolutions.
  • The fact that NOAA doesn't mention Linux in the press release means that NOAA doesn't care what the box is, if it meets the peformance requirements.

    If SGI or IBM (the two other leading competitors) had won, the press release wouldn't have mentioned Irix or AIX either.

    HPTi could deliver 10,000 trained monkey's in a box if it met the performance requirements.

    The fact that a Linux solution could exceed the performance of an SGI or IBM supercomputing solution is important to the Linux community, but not directly to NOAA.

    Mark Vernon
    HPTi
  • To say that I'm favorably impressed by the performance of the Compaq ccc compiler would be a major understatement. IMHO, with the release of this compiler, they have just overcome the Intel price/performance issue.

    I've seen 280% speedups over gcc's best effort, more than justifying the 100% price premium of the hardware over (for instance) dual PIII boxen.

    If I was going to put in a number crunching cluster (and I may) AlphaLinux would be the best way for me to go, cutting 40% from my TCO over IntelLinux.

    Thanks Compaq!
  • >You build your NUMA box that has 1 fat highway, and it turns up like the subway systems in the metropolitan areas. The whole
    purpose of hypercube or 5-D torus is to have a shortest path to as many places as possible, instead of hopping onto that megapipe
    and making a stop at every node to see who wants to get off.



    Technicaly you are correct. What I wanted to illistrate though is that in big NUMA boxes, you have one copy of the kernel running all processors. With a Beowulf system, and a Cray T3E I believe, you have a local copy of the kernel on each node of one or two processors. This negates the SMP problems of Linux on multi-CPU machines.
  • its beta. if they were giving it away for free
    there would be no reason not to just make thier
    own back end to gcc for the alpha.

    i still dont know why compaqs doing this...
  • Yes, but on your 32bit system it's using twice the clicks to do it relative to the 64 bit system. This is because it has to address two values, perform the operation, then recombine. With a 64bit system you would cut your number of addresses in half for those 64bit ops that are currently being split and multichannelled.

    Granted, it's not a 2 to (1 + 1) performance ratio in the truest sense but the concept is valid if not the accuracy of my description.

    On top of that, the previous post said nothing about running on 32bit. Alpha and several other currently available systems are running 64bit today (and for the past several years). True, x86 is not 64bit. IA-64 is not really an x86 processor but the next generation from Intel. IA-64 will bring Intel more in line with what other chip manufacturers have been doing for extreme high end systems for years and will bring it to prominence on the desktop.

    D. Keith Higgs
    CWRU. Kelvin Smith Library

  • Compaq already has a compiler. It's very inexpensive to port it to a new OS; they even already had ELF from another project. It would be much more expensive for them to play with the gcc back-end.
  • You would need each node to boot an OS image and I would prefer the optomized one. either way you need a diskette in each node to boot or use special ethernet cards that boot from a central server. This would be bad because it would hurt performance on the bottlneck of the super computer which would be the speed of the ehternet. The floppies would also need to contain the special messaging software. Again the ethernet would clog everything if its from a central server. Besides you only boot once and after its booted the diskette is no longer used. The other method is to install beowolf on each harddriver. This would take too long to install.
  • by winnt386 ( 2942 )
    I made a few spelling errors. I also meant hard drive on the second to last sentence. Sorry
  • Quicksort...
    Mandelbrot...
    Neither of which bears the slightest resemblance to the kinds of code presently found in weather models such as MM5, nor planned for the WRF.

    If you want to benchmark, then do a meaningful benchmark!

  • I am curious as to the what the determining factors were for selecting Alphas over Pentium-based systems.

    I've installed Linux once on an Alpha box and the BIOS is truely impressive, much better than PCs. But what are some of the other reasons? Wider data/cpu buses? Larger memory configurations?

    Anyone who actually uses Linux on Alphas is encouraged to reply.
  • The 21264 is just better architecture all around. First of all, everything is 64 bit. Secondly, the FP is 10 times faster than the current P3. Thirdly, Compaq now has recently released compilers for Linux that provide an optimal 30% speed increase.

    Probably the best thing is that engineers like alphas, and they like linux.

    Pan
  • That may kick ass, but imagine a Beowulf cluster made out of... oh wait, it already IS a cluster. :) I guess they need to get to work on that internet tunnelling massive computing surface initiative if they want to make this computer part of a Beowulf cluster...
  • Good to see that using Linux as a tool, a company can provide a commercial grade super computer at what appears to be a very attractive cost/performance ratio.

    Along with the use of Linux in digital VCRs and other Internet appliances this goes a long way to validating Linux as a viable, and very flexible commercial platform.

    -josh
  • This kind of system would be great in helping to minimize the damage to life and property in the tornado ravaged areas of the Midwest. Having recently wittnesing a tornado for the first time (In downtown SLC no less) I have a new interest in tech like this.
  • by Troy Baer ( 1395 ) on Friday September 17, 1999 @07:13AM (#1675873) Homepage

    I've installed Linux once on an Alpha box and the BIOS is truely impressive, much better than PCs. But what are some of the other reasons? Wider data/cpu buses? Larger memory configurations?

    The big thing about the Alpha for people like NOAA (who run big custom number-crunching apps written in FORTRAN) is its stellar FP performance. A 500MHz 21264 Alpha peaks at 1 GFLOPS and can sustain 25-40% of that, because of the memory bandwidth available. A Pentium III Xeon at the same clock rate peaks at 500MFLOPS and can sustain 20-30% of that.

    That doesn't fly for everybody, though. Where I work, we have a huge hodgepodge of message-passed, shared-memory, and vector scientific codes, plus needs for some canned applications that aren't available on the Alpha. We picked quad Xeons for our cluster and bought the Portland Group's compiler suite to try to get some extra performance out of the Intel chips.

    --Troy
  • by Panaflex ( 13191 ) <`moc.oohay' `ta' `ognidlaivivnoc'> on Friday September 17, 1999 @07:13AM (#1675874)
    Oops!! Sorry! 10 times faster is a wrong. True specs are: (from www.spec.org)

    (UP2000 21264 667MHz -Alpha Processor Inc)
    53.7 SPECfp95
    32.1 SPECint95

    The P3 is

    (SE440BX2 MB/550MHz P3 -intel)
    15.1 SPECfp95
    22.3 SPECint95
  • Although HPTi may believe in Linux as a clustering solution, it would appear that they have trusted their web page to IIS 4.0. It also seems that their web authoring tool is MS based, judging from the occurence of "?" where normal punctuation would be found.

    This is good news, but it only affirms the role of Linux in niche markets. It will be some time before it is accepted widely as a general purpose business or desktop solution.
  • by aheitner ( 3273 ) on Friday September 17, 1999 @07:17AM (#1675877)
    General Processor Info [berkeley.edu].

    Compare the SPECfp scores of high-end Intel and Alpha offerings. Take a look at a 600MHz PIII Xeon and a 667MHz Alpha 21264.

    The reason to choose Alpha should be obvious.
  • i hear all of the great tales of lore about boewulf cluster and their amazing speed yet i am forced to ask if it will perform as advertised. as i understand it, (and i may be way off here, so please correct me) beowulf clusters do not completely overcome the problems that linux has with multiple processors. of course this is something hoped to be fixed in later kernel releases, but does the noaa really have the time to bring down a system such as this for kernel recompiles? a very fast machine? yes. but will it ever live up to it's full potential? i hope it does, but i still have to wonder.
  • by coyote-san ( 38515 ) on Friday September 17, 1999 @09:36AM (#1675880)
    I worked at FSL for several years, although on a different project. I knew people working on the weather models, and I took a class on parallel processing from the CU professor who shared the old Paragon supercomputer with NOAA. I even had an account on the Paragon briefly (for that class) after leaving NOAA.

    NOAA needs to solve partial differential equations (PDEs). A *lot* of PDEs. My class spent a lot of time on solving numerical methods, and my entire undergraduate class in the early 80's was covered in the first lecture of my graduate class a few years ago. My Palm Pilot, running multigrid analysis, could beat the pants off a Cray-XMP running the best known algorithm from 15 years ago.

    AI programs may not scale well, but the type of work done at NOAA *does*. Furthermore the hot topic a few years ago was applying some ideas from chaos theory to weather forecasts - take a dozen systems, insert just a little bit of noise into the initial data (essentially, instrument noise in your observations), then let them all run. If all models show the same weather phenonema, you can be pretty sure that it will occur. If the models show wildly different results (e.g., Hurricane Floyd slams into Key West in one run, but NYC in the other) you know that you can't make any firm predictions. As an educated layman's guess, I expect that the reason the hurricane forecasts are so much better than just a few years ago is precisely this type of variational analysis.
  • by Troy Baer ( 1395 ) on Friday September 17, 1999 @09:40AM (#1675881) Homepage

    If the G4 can sustain >1gflops, then why not build a cluster of G4s running LinuxPPC?

    I'm not convinced the G4 can sustain 1 GFLOP/s in any kind of real calculation -- it simply doesn't have enough memory bandwidth. The G4 uses the standard PC100 memory bus, AFAIK. That's 64 bits wide running at 100MHz = 800MB/s peak. So without help from the caches, the absolute best you can do is on *any* PC100 based system is 200 MFLOP/s using 32-bit FP or 100 MFLOP/s using 64-bit FP. In practice you can only sustain about 300-350 MB/s out of the PC100 memory bus, so things get even worse. The caches will help quite a bit (maybe a factor of 2-4), but I have trouble imagining the G4 being able to sustain over 500 MFLOP/s even on something small like Linpack 100x100 because of the limited bandwidth and latency of the PC100 bus. Other processors that have similar peak FP ratings have much higher memory bandwidths; we've benchmarked an Alpha 21264 (1 GFLOP/s peak, ~400 MFLOP/s sustained) at about 1 GB/s memory bandwidth (that's measured, not peak), and a Cray T90 CPU (1.8 GLOP/s peak, ~700 MFLOP/s sustained) at 11-13 GB/s (again, measured not peak).

    There's also the question of compilers. You have to have a compiler that recognizes vectorizable loops and generates the appropriate machine code to use the vector unit. Unless Motorola's feeling *really* magnanimous, I don't see that kind of technology making it into gcc (and g77, more importantly for scientific codes) any time soon. Otherwise, you're at the mercy of a commercial Fortran compiler vendor like Portland Group or Absoft. PGI hasn't shown any interest in PowerPC to this point, and Absoft currently does PPC compilers only for MacOS 8, not OSX or LinuxPPC.

    I'd love to be proven wrong on this, but based on my experience I don't see how you could do it.

    --Troy
  • >Gotta admit I'm a little confused here. If you have a computationally >intensive task, why would you ever want to run it through any VM or >interpreter?

    The answer is you *DON'T*. This is basically crap from the JAVA crowd trying to pretend that JAVA is actually something you'll actually want to use in the real world. The Amiga Arexx crowd used run around pulling the same kind of stunts too. I wouldn't be too surprised to discover if in fact a large number of the JAVA advocates posting here also ran around adovacating the use of AREXX for *everything* on the Amiga, no matter how silly it was.
  • I would suspect that there's a number of reasons why NOAA went with the solution they did, and not just merely because it's a fast set of machines running a fast operating system.

    Every six hours the National Weather Service sends out to all of it's forecast offices around the country a series of models to help in local forecasting. Each model is based on a massive amount of information that comes in to their central office, and that information is used in preparing the next set of forecasts. Now, you would want a) a system that is capable of processing all of this information rapidly and reliably, with b) redundancy built in so that if a part of the system goes down, you're still able to digest and transmit those models. Using a cluster of systems gives you that backup redundancy, and using a stable operating system gives you that speed and reliability to churn out models reliably.

    The people at NOAA likely could care less about advocacy in this respect. What they want is a system that they can use, provide them the reliability and performance that is demanded, for a reasonable cost. $15 million for a distributed cluster that gives them a lot more bang for the buck is definitely money well spent. And remember, this IS your tax dollars at work, one of the few times you will ever see it spent for a truly worthwhile cause.

    -Tal Greywolf
  • A friend of mine tried linux alpha and the performance was quite bad. After a posting at a newsgroup she found out that the gcc compilier is not optimized for alpha. She had to buy an expensive c/c++ one for the alpha box and then after a recompile the performance was great. I wonder how hard it was to get the cluster going wiht the compilier issue. I would hate to make 80 diskettes for all the machines because of licensing issues with the compilier. I heard alpha linux lacked some features of the standard intel one. Is this true or was it refering to the unoptimized compilier that comes with alpha redhat linux?
  • Wrong, it's actually Microsoft's JVM, as a java programmer, I can vouch for that too.

    MS also won the award for best VM at JavaOne.
  • Alphas have several rather large advantages over intel boxes. First, the floating point performance has a theoretical peak of twice that of a similarly clocked intel. Secondly, the memory bandwidth is significantly better. Also, alphas have 64bit PCI slots; I have never seen an intel motherboard with 64 bit PCI slots, though it seems like to me they should exist. Anyway, the peak bandwidth of myrinet is greater than 32 bit
    PCI can support, so your NIC becomes a message passing bottleneck without 64 bit PCI.

    There are various types of alphas available. As has already been mentioned, the 21264 (ev6) is the latest and greatest. Price/performance wise, however, you simply can't beat its older cousin, the 21164 (ev56). Volume sales have driven the cost of the 21164 down to right around the same cost as a similarly clocked Intel box.

    Someone mentioned the K7, or AMD Athlon, as being faster than a alpha. Not true. It has exactly the same floating point peak, and has the same bus as the ev6. However, due to its x86 instructions set, software has access to only 8 floating point registers, which means achievable peak is going to be quite a bit lower for the Athlon than for the ev6 (you wind up continually reloading stuff from L1 that you can keep in registers on the ev6).
  • come on now! we all know that plot is irrelevant these days!!! its all about explosions and breasts.
  • ...Now despite what most people think, the bottleneck (in this example) is in fact the I/O...
    LL, tell me about the analysis of computational complexity of your problem... or have you even analyzed it?? To model a particular domain for a particular time period, assuming a fixed archival output frequency (e.g., "We are saving snapshots every 15 minutes for analysis and archiving"), your I/O requirements vary inversely with the cube of your spatial resolution, whereas your computational intensity varies inversely with the fourth power. If you have a system with both performing satisfactorily at 5 KM, then at 100 M, you need 50^3 times the I/O but 50^4 times the CPU. In other words, if you bring in a new system in which you've scaled everything up by the same factor, and you think you have enough I/O, then you're way underpowered in the CPU department (you need 50 times more than you've got!!).

  • yeah, the anti-java bigots here ALWAYS forget about towerj
  • what are the areas that typically require heavy-duty processing power? all i know of is weather modeling and graphics rendering...
  • Don't know about the latest Alpha based systems, but by the terms of some supercomputer apps, such as matrix algebra stuff (FE, CFD, etc.) the bigger DEC servers were nothing to right home about around 18-24 months ago when I was doing a lot of benchmarking.

    The peak total memory bandwidth available then was 2.4Gb/sec in the AlphaServer 8400, and it really had an impact on big calcs - can't speak for SPECfp, but for a big matrix algebra calc you need (asymptotically approaching) 4 bytes/sec per "flop", and these systems just didn't cut it.

    I won't even speak about 32-bit Intel boxes - the 100MHz cache bus sucks enormous rocks, and the 4Gb memory limit (3Gb with NT, less with Linux IIRC) cuts it out of the big job league anyway. This is maybe OK if it's a node in a large MPP system, but these days you want to be able to bring 64Gb or more of RAM to bear on a single problem.

    The question we used to hear from our engineering staff was along the lines of: "Hey, my desktop PC is n-zillion MHz, and it runs this tiny test calc almost as fast as the big machine, why don't we just get a lot of big twin Xeon PC's with XYZ graphics cards?". Or occaisonally, the same thing in favour of SGI workstations - engineers love toys just like the rest of us.

    This is the classic misconception caused by benchmarks in the FE industry; a lot of test calcs will fit in the cache on a Xeon PC or an R10k or UltraSparc workstation, and show pretty acceptable performance, but the dropoff when you move to a larger problem size and start hitting RAM is sudden and dramatic.

    By comparison, if you look at real supercomputers, like the high end Crays or NEC SX series, memory bandwidths of 2 to 4 Gb/sec *per processor* are the norm.

    The machine we ended up buying to replace a low-end vector Cray was - an HP V-Class.

    The PA-RISC has excellent scoreboarding and memory bus, and the Convex architecture keeps it well fed. We tested on the Convex S-Class hardware running at 180MHz with SPP-UX, and HP guaranteed that the delivered system running HP-UX would meet the clock over clock speedup ratio, which it did with room to spare. We saw well over 700 MFlops *sustained* per CPU on a 200MHz PA-8200 using rather nondescript FORTRAN, against a theoretical peak of 800.

    The picture with the newer PA-8500 machines is not so rosy, as the memory bandwidth does not seem to have been scaled up with the capabilties of the new CPUs, especially with double the number of CPUs per board. Nevertheless, as the previous posters' figures would indicate, I believe the sustained throughput still exceeds that of the latest Alpha based systems for certain types of job, and the price/performance is very good.

    Of course, for the rabidly religious, Linux is still not well supported on PA-RISC, and doesn't handle the high end hardware.
  • We use alphas for the following reasons:
    1) They scale very easily
    2) They process very quickly
    3) They are totally modular, so if something breaks its very easily replaced.
    4) Pentium based servers haven't quite got the architecture to allow for multiprocessing and multiuser processes.

    Its good to see this happening especially after Microsoft stopped NT on Alphas. This would have traditionally been thier area. If this sort of thing continues Linux would get a lot of kudos and respectibility, which can only be good.

    I keep thinking back to the Coca-cola/Pepsi war, and the moment Coke changed their formula. Maybe Microsoft have just done the same thing and lost a lot of the battle.

    IA64 is good, but it will be a long time before it gets the stability and respect that Alpha processors currently have.
  • They have absolutely no need for sorting data or for doing calculations?
  • An oddity of the alpha design is that with each new evolution of the chip, the clock speed is actually dropping, and the processing power is increasing. This design means that they don't have to spend so much on working out all the cooling required for the board and concentrate on actually making the bus go fast.
  • Though a good number of the people who responded to this are obvious flamebaiters, I'll take a minute to follow-up anyway.

    1) We're not using Java to gain in performance, obviously, we're trying to optimize performance
    of a system already written in Java.

    2) Solaris x86 JVMs also sucked. In fact, when we made the NT decision, JVMs on Solaris SPARC AND Solaris x86 were slower than on NT. Extensive benchmarking was done, using both our software, and simple benchmark tests.

    3) Only one person suggested that maybe Linux does need a better JVM. It's ironic that the response is to attack our software (which you know nothing about), Java, and our intelligence, rather than to suggest that writing a good JVM would be useful... R&D folks are taking a liking to Java, and without a good JVM Linux will be unusable by a fair portion of the R&D community.

    4) Actually, one of our people is writing a better JVM, though obviously it will be of little use to any of you...

    5) Um, we don't need Beowulf "to run Java", we need a cluster or supercomputer to run the very complicated software we've written in Java.

    It's funny, rather than being interested in how to expand the horizons of Linux and maybe try to understand why someone would want to use a VM based language like Java, people just get all uppity. Your computing paradigm is challenged, time to get defensive...

    Whatever.

    We're doing fine without Linux, actually, I just thought maybe some other Linux folks would be interested in writing a decent JVM, but we'll do it ourselves...

  • You're obviously too biased and ignorant to understand, but actually VM based systems are very useful for some real world issues such as system portability (Java runs on lots of stuff, few portability issues except with AWT UI stuff), easier verification of program correctness (pointers screw that right up), possibility of supercompilation (can't do that properly with pointers, either), etc. There is also a development time issue for very large systems, as we did not have to write our own memory management schemes, and the issue of this version of the system having been written primarily by scientists first, not programmers, making Java a good choice for ease of use.

    Java compilers (and supercompilers, which would run prior to a compiler, actually) are being developed, and while the compilers may not speed things up much, supercompilers will.

    So, if the JVMs don't totally suck, Java is about as good as C++, and only 2-3 times slower than C.
    With JNI we could rewrite very computationally intensive parts of the program in C, as well. As things like TowerJ and HotSpot are ported to Linux and other platforms, speed-ups occur there, as well.

    All in all, if you're working in C++ you can get roughly the same performance from Java... (it will require a lot of tricks to get C level performance... maybe even the Java chip... but so what? most of the system doesn't need it... many systems don't...)
  • All your points are valid and I'll briefly explain the nitty gritty:

    1) global circulation models are actually done by people in the US, downscaling via nested regional models are limited to this part of the world and if and when the system becomes operationalised, is expected to be distributed. Think cooperating groups around the world sharing the CPU burden

    2) the 100m models are interfaced to streamflow and catchment models which are only a comparatively small region set within with the wider desert (rather uninteresting). Think sparse multi-resolutional hierarchy.

    3) futher submodels are inherently linear in space/time, while the climate fields are calculated once, the bulk of the operational landscape runs the scientists are interested in are multiple ensembles which require lots of memory, hence some rather painful use of staging and compression. Think conversion to streaming media rather than static files.

    If you're interested in more details, send me your email and I'll point you to some of my papers.

    Regards,
    LL
  • Wow, that is one fast machine. All for just weather! Sure, it's not the fastest machine out there, but 4 TFlops for finding out if it's going to rain on Saturday? heh. just joking. The mathematical models used in weather forcasting, and understanding the complexities of even a single supercell (which produces thunderstorms and tornadic activity) is mind-boggling.
  • AC said: "Either way, you could make Toy Story in about 10 minutes on this thing once it's up."

    Yes, but what kind of plot? Would it be Woody and Mr. Potatohead lost in a hurricane with a large number of penguins?

    Raw power is cool, but art takes a bit more than that.

  • Looks like the folks at NOAA are shooting to confirm what we already know, and what Microsoft is hoping to learn if they can ever get Windows ported to a 64bit system. A 64bit capable OS (Linux) on 64bit iron (Alpha) absolutely SCREAMS next to the identically clocked, sameinallotherrespects 32bit system running the 32bit version of said OS.

    Remember those vast performance diffs between the 80386SX-16 and the 80386DX-16? That's what we got here.

    7lt;Note-to-Microsoft> Nanny-nanny-nah-nah, our OS runs on IA-64 and yours won't.7lt;/Note-to-Microsoft>

    D. Keith Higgs
    CWRU. Kelvin Smith Library

  • For $15,000,00 to buy an Alpha Beowolf, it sounds like they might have 2,500 nodes with a 'decent' Alpha system. But if they go really high end, they'll have about 750 nodes (For the 'killer' $20,000 Alpha machines).

    That doesn't include the cost of the Myrinet cards and switches, racks, 3rd party software, support people, power, cooling, etc. Believe me, if you're paying $15M for a machine, part of it better be going for support personnel and infrastructure. The configuration's probably more like 250-500 nodes with a corresponding number of Myrinet cards and switch ports, 30-75 racks (8 nodes/rack if you're lucky), a *buttload* of power and air conditioning, and 2-5 onsite support people working in it full time.

    --Troy
  • Hold on folks, this isn't necessarily a beowulf. I could not find the word "Beowulf" on the HPTi page. (Maybe I didn't look hard enough though).

    Not every Linux cluster is a Beowulf. The fastest alpha Linux cluster [sandia.gov] in existence is not a Beowulf.

    Anyone know what they plan to use?
  • by Anonymous Coward
    All that is being said is that Linux is being used for one type of supercomputing task: weather forcasting models. Some people are joyful about that. But that does not infer they think Linux is a solution for everything. It is reasonable to infer though, that since you are making an incorrect logical inference, that your logic may be flawed in other areas of reasoning. I can't say for sure wether your JAVA/NT solution is the best solution for your application. But since we have already established that you are a person of flawed logic, I wouldn't place alot of confidence in your decision to use NT.
  • visit the SETI@home CPU type statistics page [berkeley.edu]. -- Alpha EV6 and EV67's are rockin' ass^H^H^H, if not as much as the "Intel Puntium" or "PowderPC" chips...
    --
  • by LL ( 20038 ) on Friday September 17, 1999 @10:08AM (#1675914)
    Buying the hardware is only 15-30% of the total cost. Also, in a production environment, you should not be fixated by the CPU. The question should be, within the capital budget, what is the best combination of resources that maximises the effectiveness of achieving your mission.

    To give you some real-world experience, a group I'm working with is looking at continential-scale simulation at a 5km resolution with the aim of going down to 100m. Now despite what most people think, the bottleneck (in this example) is in fact the I/O, with estimated total requirements of 30 TBytes. Doing the sums show that to keep up with the CPU (say hypothetically 1 run/24 hours), you would need average throughput of 350 MByte/sec. Hardware that supports both this volume and capacity is NOT cheap. We would joke that we paid x million for the I/O and SGI would throw in the Cray for free :-).

    Now as for how an Alpha cluster could be used, it would fit very nicely into the dedicated batch box category. It has a very high CPU rate and some decent compiler optimisation. As such it would augment whatever existing environment exists, reducing the workload of the more expensive machines for development which generally have better tools (just you try debugging a multi-gigabyte core dump). The biggest problem nowadays is not the algorithms, but managing the data traffic to the CPUs and this is where Linux clusters are weak with relatively slow interconnects, unbalanced memory hierarchies, and cheaper but higher latency memory. You have to accept the disadvantages and shift jobs which are not suited for this architecture off. A bit of smarts goes a long way in stretching the budget.

    LL
  • Anybody else noticed that Linux is not mentioned anywhere in the NOAA press release while it's promimently displayed in the integrator's ?

    Is the NOAA afraid to say that they are basing a 15 million dollars investment on free software rather than on something from Microsoft/Sun/IBM/whatever ?

  • I think another (better?) answer is that gcc/egcs doesn't have much in the way of DSP type stuff, where you do parallel computations. Alphas get performance inherently, as its FPUs are very good, and it does not have to d!ck with SIMD instructions - something that many compilers don't do well anyways - usually you have to call hand coded assembly to get good performance out of SIMD (= single instruction multiple data, where one instruction is executed on multiple sets of data - like MMX, KNI (SSE), AltiVec, etc)

    And the raw bandwidth of even the unreleased G4s trail that of three year old Alpha designs anyways, and now there's the switch-matrix arch that gets close to twice that of the new G4's theoretical bandwidth (EV6 500 ->> ~2.6 GB/s, G4 (7400) -> ~0.8 GB/s). This is the 'theoretical', Alphas still get 1.3GB/s in sustained throughput, 50% more than G4s Theoretical
  • by quade]CnM[ ( 66269 ) on Friday September 17, 1999 @10:39AM (#1675918) Homepage
    This is not true of Masively Parallel Systems such as boewulf. The problem with Linux and scalibility is more of a hardware problem then a software problem. While you aren't going to put Linux on a Sun E10k anytime soon, it was never ment to be on such a large SMP machine. The Intel SMP artecture is flawed in design. All processors share the same buss. Therefore if one processor can sustain 300M/sec of transfer, and you have 4 processors. That 800M/sec buss is going to slow down. now your processors are only 2/3 as efficient as they are in a single system. But you are probably going to be slower then this because most RAM has a sustained transfer rate of only 150-200M/sec. so you only use 1/2 of the processor.


    To fix this, you use 2 processor busses, and 2 memory busses. you fill these up, and you get 4 processor busses and 4 memory busses. now you need to connect these buss segments. You have several options. First, connect them within the same machine. This is what NUMA is. the other route is to put each bus in a seperate machine, each machine running a copy of the kernel localy, and connect each box together with a fast network. This is what boewulf is.

    To give you an example. think of a highway system. If you have a lot of traffic switching lanes(busses) constantently, then it would be best to build one big 20 lane highway(NUMA). but if all the traffic basicaly keeps in its own lane, without much need to switch lanes(Inter Process Comunication) then it may be more economical to build 10 2lane highways(boewulf).


    Infact ins't a cray T3E more of a boewulf type cluster of closly nit machines then a NUMA. I think each node on a T3E runs a local copy of the micro-kernel.
  • You seem to know what you're talking about so I'm worried I may be missing something, but I don't see why Motorola needs to feel magnanimous to contribute optimisations for their chips to gcc. Wouldn't they just need good business sense? Anything that increases the value of their processors must be a good thing for them. Or is vectorizing loops so hard a problem that they'd spend more than they'd gain?
    --
  • by Anonymous Coward
    Heck, My research center has Vacuum tubes computer that is faster than ASCI Red + All the flavors of Blue (9000 PPro + 6000 MIPS + 2000 Power3) You see, the trick is in the implementation. If you take 1 wavelength of an analog signal, there could easily be 100,000,000 discrete levels(especially with a 10,KV plate voltage.) Fine tuning of the voltage differentiation amplifier would probably quadruple the speed even more. Now we only have to upgrade the holographic scanner for the punchcard readers.

    Forget about any of these digital OS, we even implemented our own ANALinux, which used OS technology that was originally implemented for the quantum computers that is slow to come about. Except for the fact that probability wave algorithm in the kernel was reimplemented with the electron wave method(more descrete.)

    We can't open source it yet, since the whole kernel runs via negative feedback, so it is constantly being upgraded. We could take a snapshot of the loaded kernel image by detaching all the ferrule doughnuts at the same time, but the source would all be in analog stream and useless unless you have another valve box.

    It easily interfaces with outside systems even though it is 100% analog inside due to the (ported) quantum kernel's interface, which utilizes the duality of the wave and sends discrete signals to outside the box. The only problem is the primitiveness of current technology. Since petabit networking has not been implented, we basically watch the tube's change in brightness as I/O. Current internet access by outsiders is via out webcam pointed at the tubes.

    This OS is totally unhackable since nowbody know how to hack it. Input is vial variosistors instead of toggle switchs, so all the script gramps who hacked their way into Univacs would not know how to break in.

    So all you digiphiles, put you toys down and use the computer that work like the way humans do.
  • Not mentioned by the NOAA official release. Why would they need to mention it. Since it's HPTi that is going to be doing the work. It wouldn't be their place to say what how the HTPi would do what they are going to be contracted to do. Linux is mentioned 5 times at least ont he HTPi press release.
  • by Anonymous Coward
    Well, 10x could be true for the code these guys may be running. (spec is not everything, this is very important for Memory Intensive code). Take a look at STREAM, (memory bandwidth bench) PIII ~ 300MB/s Alpha DS20 ~ 1300MB/s And since these systems use EV6 "buses" each processor gets all that bandwidth to its self in multiprocessor systems. But back to spec, here are some more numbers Published results at www.specbench.org (Compaq XP1000 667 Mhz) 65.5 SPECfp95 37.5 SPECint95 (Compaq GS140 700 Mhz) 68.1 SPECfp95 39.1 SPECint95 Informal results (www.novaglobal.com.sg) (These systems have better memory systems than those above) (AlphaServer DS20 667 Mhz) 72 SPECfp95 38 SPECint95 And you can get a well equiped system (DS10) from www.dcginc.com for only $3500.
  • What sort of software is it running? What exactly uses all this power (I know how fiendishly complex weather predictions are...I'm just curious what kind of software exists/is being developed for it...)?

  • by Chalst ( 57653 ) on Friday September 17, 1999 @07:58AM (#1675927) Homepage Journal
    When you talk of linux's problems with mulitple processors, I think that you are referring to its limited SMP capacity.

    SMP (Symmetric Multi- Processing) is fundamentally different to clustering, as all of the processors in an SMP configuration share the same memory bus, whilst in a cluster the machine architectures are distinct, and we use a high-speed network to exploit parallelism.

    See the Linux Parallel Processing HOWTO [purdue.edu] for more information.
  • by Anonymous Coward
    Well, 10x could be true for the code these guys may be running. (spec is not everything, this is very important for Memory Intensive code). Take a look at STREAM, (memory bandwidth bench) PIII ~ 300MB/s Alpha DS20 ~ 1300MB/s And since these systems use EV6 "buses" each processor gets all that bandwidth to its self in multiprocessor systems. But back to spec, here are some more numbers Published results at www.specbench.org (Compaq XP1000 667 Mhz) 65.5 SPECfp95 37.5 SPECint95 (Compaq GS140 700 Mhz) 68.1 SPECfp95 39.1 SPECint95 Informal results (www.novaglobal.com.sg) (These systems have better memory systems than those above) (AlphaServer DS20 667 Mhz) 72 SPECfp95 38 SPECint95 And you can get a well equiped system (DS10) from www.dcginc.com for only $3500.
  • This is going to count as flamebait.. but has
    anyone had any experience porting MPI code from linux/solaris clusters to NT? i.e. same hardware?
    And assuming the same compilers etc.. Part of the problem is until recently there have been (allegedly) really nice compilers for NT that have not been available for linux. Also BLAS routines were native only for NT by intel for ages. I think they have been ported over now.

    From my understanding for most work, it makes absolutely no difference as the overhead of the OS should be negligable. In my experience w/ single processor jobs w/ large memory jobs (say > 500 megs), Solaris tends to run smoother.


    I ask this because I am moving to a school soon that got bought out my microsoft and they have ported all their code to just this: NT Clusters using MPI (from microsoft grant money that is being dumped on all the schools p.s. we got it here.. we just umm formated the drives ;)) and I am *really* *really* not looking forward to coding on NT but it could be a learning experience (of sorts)... but i'd be interested in hearing what I should probably expect. (shortcomings advantages etc??)

If I want your opinion, I'll ask you to fill out the necessary form.

Working...