Forgot your password?
Supercomputing Hardware Technology

Supercomputer Advancement Slows? 86

Posted by Soulskill
from the moore-flops-moore-problems dept.
kgeiger writes "In the Feb. 2011 issue of IEEE Spectrum online, Peter Kogge, an IEEE Fellow and professor of computer science and engineering at the University of Notre Dame, outlines why we won't see exaflops computers soon. To start with, consuming 67 MW (an optimistic estimate) is going to make a lot of heat. He concludes, 'So don't expect to see a supercomputer capable of a quintillion operations per second appear anytime soon. But don't give up hope, either. [...] As long as the problem at hand can be split up into separate parts that can be solved independently, a colossal amount of computing power could be assembled similar to how cloud computing works now. Such a strategy could allow a virtual exaflops supercomputer to emerge. It wouldn't be what DARPA asked for in 2007, but for some tasks, it could serve just fine.'"
This discussion has been archived. No new comments can be posted.

Supercomputer Advancement Slows?

Comments Filter:
  • ...of all the existing supercomputers.

  • Is that Crysis 2 isn't out yet. When it is, people will be all going out to buy their own supercomputer to run the game.

  • by mlts (1038732) * on Friday January 28, 2011 @12:44PM (#35033850)

    In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.

    Now, most applications are able to be done by COTS hardware. Because of this, there isn't much of a push to keep building faster and faster computers.

    So, other than the guys who need the top of the line CPU cycles for very detailed models, such as the modelling used to simulate nuclear testing, there isn't really as big a push for supercomputing as there was in the past.

    • by vbraga (228124) on Friday January 28, 2011 @12:51PM (#35033972) Journal

      I don't know if this is true.

      Weather modeling is still done on supercomputers.

      Engineering applications needs high performance computing on a regular basis: geophysics (offshore oil, 4D seismic, ...), materials science (MD, ...), and others. There's also academical problems.

      I've seen a lot of new HPC centers being built or getting new equipment in the last few years (Rio de Janeiro, Brazil). From small CUDA clusters to heavy duty Cray systems (not in Rio, but nearby).

      • The #1 supercomputer in the world is one of those CUDA clusters in China. [] . At the moment, nVIDIA is where it's at for HPC as I understand it.
        • benchmarks aren't real work. and sadly the tail is wagging the dog to a great extent as people design computers to be good at benchmarks, rather than being as good at a real workload as possible and designing the benchmark to resemble the workload. it's a contest of napoleon complexes.

          i'd judge an architecture not by their slot on the benchmarks lists, but by the number and complexity of real workloads they actually are used for.

          • by allenw (33234)
            ... and to make matters worse, top500 is based primarily on LINPACK. So top500 is really a measure as to how fast something can do floating point with a distributed shared memory model and not much else. Most of the systems listed in the top500 would fail miserably at heavy IO loads, which is what most of the increasingly common Big Data problems need. It concerns me that manufacturers are building systems for one top of heavy duty computing based on top500 while ignoring the others.
    • In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.

      Now, most applications are able to be done by COTS hardware

      It's true, many applications that needed supercomputers in the past can be done by COTS hardware today. But this does not mean there are no applications for bigger computers. As each generation of computers assume the tasks done by the former supercomputers, new applications appear for the next supercomputer.

      Take weather modeling, for instance. Today we still can't predict rain accurately. That's not because the modeling itself is not accurate, but because the spatial resolution needed to predict rainfall b

      • by bberens (965711)
        I think as competition grows in the cloud computing market we'll see a lot more modeling being done on the cloud. There's a lot to be said about having your own supercomputer for sure, but if I can get it done at a fraction of the cost by renting off-peak hours on Amazon's cloud... I'm convinced the future is there, it'll just take us another decade to migrate off our entirely customized and proprietary environments we see today.
        • by dkf (304284)

          I think as competition grows in the cloud computing market we'll see a lot more modeling being done on the cloud. There's a lot to be said about having your own supercomputer for sure, but if I can get it done at a fraction of the cost by renting off-peak hours on Amazon's cloud... I'm convinced the future is there, it'll just take us another decade to migrate off our entirely customized and proprietary environments we see today.

          Depends on the problem. Some things work well with highly distributed architectures like a cloud (e.g., exploring a "space" of parameters where there's not vast amounts to do at each "point") but others are far better off with traditional supercomputing (computational fluid dynamics is the classic example, of which weather modeling is just a particular type). Of course, some of the most interesting problems are mixes, such as pipelines of processing where some stages are embarrassingly distributable and oth

        • by sjames (1099)

          The cloud can handle a small subset well, the embarrassingly parallel workloads. For other simulations, the cloud is exactly the opposite of what's needed.

          It doesn't matter how fast all the cpus are is they're all busy waiting for the network latency. 3 to 10 microseconds is a reasonable latency in these applications.

    • by dr2chase (653338)
      The problem, not immediately obvious, is that if you shrink the grid size in a finite-elements simulation (which describes very very many of them), you must also shrink the time step, because you are modeling changes in the physical world, and it takes less time for change to propagate across a smaller element. And at each time step, everyone must chat with their "neighbors" about the new state of the world. The chatting is what supercomputers do well, compared to a city full of gaming rigs with GPUs.

    • "They" (we all know who "they" are) want a panexaflop (one trillion exaflop) machine to break todays encryption technology (for the children/security/public safety), of course after "they" spend umpteen billions (lotsa billions) some crypto nerd working on his mom's PC will take crypto to a whole new level, and off we go again!

    • by Darinbob (1142669)

      I think the problem here is in calling these "applications". Most super computers are used to run "experiments". Scientists are always going to want to push to the limits of what they can compute. They're unlikely to just think that because a modern desktop is as fast as a super computer a couple decades ago, that they are fine just running the same numbers they ran a couple decades ago too.

  • ..but doesnt history show us most things stall before the next large wave of advances?
    • by c0d3g33k (102699)

      No, not really. "Shit happened" is about all that history really shows us. With the correct set of selected examples, it could probably also show us that things stall and stagnate so something else can provide the next large wave of advances.

    • There's no passion in the flop race anymore...its just business. You can get something outrageous like that, but who's gonna care about it over more pressing issues like storage space. Its just not fun anymore, the computer age is dead. We're now in the handheld age, get used to it folks. :D

      Somewhat unrelated to the article I suppose, and I never thought I'd say this but buying a new computer is...boring.

      There I said it.

      I remember a time when if you waited three years, and got a computer, the difference
  • Current super computers are limited by consumer technology. Adding cores is already running out of steam on the desk top. On servers it works well be cause we are using them mainly for virtualization. Eight and sixteen core CPUs will boarder on useless on the desktop unless some significant change takes place in software to use them.

  • I would be very willing to run something akin to Folding@Home where I get paid for my idle computing power. Why build a super computing cluster when, for some applications, the idle CPU power of ten million consumer machine is perfectly adequate? Yes, there needs to be some way to verify the work, otherwise you could have cheating or people trolling the system, but it can't be too hard a problem to solve.

    • Two problems:
      1. The value of the work your CPU can do is probably less than the extra power it'll consume. Maybe the GPU could it, but then:
      2. You are not a supercomputer. Computing power is cheap - unless you're running a cluster of GPUs, it could take a very long time for you to earn even enough to be worth the cost of the payment transaction.

      What you are talking about is selling CPU time. It's only had one real application since the days of the mainframe days, and that's in cloud computing as it offe
    • by ceoyoyo (59147) on Friday January 28, 2011 @01:17PM (#35034290)

      Because nobody uses a real supercomputer for that kind of work. It's much cheaper to buy some processing from Amazon or use a loosely coupled cluster, or write an @Home style app.

      Supercomputers are used for tasks where fast communication between processors is important, and distributed systems don't work for these tasks.

      So the answer to your question is that tasks that are appropriate for distributed computing are already done that way (and when lots of people are willing to volunteer, why would they pay you?).

    • by Anonymous Coward

      That kind of thing (grid computing) is only good for 'embarrassingly parallel' problem. You cannot solve large coupled partial differential equation problems because the required communications. And most of problems in nature is large coupled PDE.

  • How long until "the cloud" becomes Skynet?
  • by tarpitcod (822436) on Friday January 28, 2011 @01:28PM (#35034432)

    These modern machines which consist of zillions of cores attached over very low bandwidth and high latency link are really not supercomputers for a huge class of applications. Unless your application exhibits extreme memory locality and hardly any interconnect bandwidth / can tolerate long latencies.

    The current crop of machines is driven mostly by marketing folks and not by people who really want to improve the core physics like Cray used to.


    Take any of these zillion dollar plies of CPU's and just try doing this:
    for ( x=0; x .lt. bounds; ++x )
            humungousMemoryStructure [ x ] = humungousMemoryStructure1 [ x ] * humungousMemoryStructure2 [ randomAddress ] + humungousMemoryStructure3 [ anotherMostlyRandomAddress ] ;

    It'll suck eggs. You'd be better off with a single liquid nitrogen cooled GaAs/ECL processor surrounded by the fastest memory you can get your hands on all packed into the smallest place you can and cooled with LN or LHe.

    Half the problem is that everyone measures performance for publicity with LINPACK MFLOPS. It's a horrible metric.

    If you really want to build a great new supercomputer get a (smallish) bunch of smart people together like Cray did, and focus on improving the core issues. Instead of spending all your erfforts on hiding latency, tackle it head on. Figure out how to build a fast processor and cool it. Figure out how to surround it with memory.


    Customers will still use commodity MPP machines for the stuff that parallelizes.
    Customers will still hire mathematicians, and have them look at ways to Map things that seem inherently non local into spaces that are local.
    Customers who have money and the mathematicians couldn't help will need your company and your GaAs/ECL or LHe cooled fastest SCALAR / Short Vector box in the world.

    • Well, yeah, if you deliberately design a program to not take advantage of the architecture it's running on, then it won't take advantage of the architecture it's running on. (This, btw, is one of the great things about Linux, but that's not really what we're talking about.)

      One mistake you're making is in assuming only one kind of general computing improvement can be occurring at a time (and there is some good, quality irony in that *grin*). Cray (and others) can continue to experiment on the edge of the t

      • by Anonymous Coward

        I hear you but sure, you can harness a million chickens over slow links and reinvent the transputer, or Illiac IV but your then constraining yourself to problems where someone can actually get a handle on the locality. If they can't your *screwed* if you want to actually really improve your ability to answer hard problems in a fixed amount of time.

        You can even just take your problem, and brute force parallelize it and say 'wow lets run it for time steps 1..1000' and farm that out to your MPP or your cluste

        • You do realize that if you go off-node on your cluster even over infiniband the 1uS is about equal to a late 1960's core memory access time right?

          Sure, but having 1960 mag core access to entirely different systems is pretty good, I'd say. And it will only improve.

          It's a false dichotomy. There are some problems that clusters are bad at. That is true. The balancing factor that you are missing is that there are problems that single-proc machines are bad at, also. For every highly sequential problem we know, we also know of a very highly parallel one. There are questions that cannot be efficiently answered in a many-node environment, but there are

          • by sjames (1099)

            a single thread of execution is GUARANTEED to be slower than even the most trivially optimized multithreaded case.

            That is true if and only if the cost of multithreading doesn't include greatly increased latency or contention. Those are the real killers. Even in SMP there are cases where you get eaten alive with cache ping ponging. The degree to which the cache, memory latency, and lock contention matter is directly controlled by the locality of the data.

            For an example, let's look at this very simple loop:

            FOR i=1to100

            You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] i

            • You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] in another, but you then have 2 problems. First, if you aren't doing a barrier sync in the loop the second thread might pass the first and the result is junk, but if you are, you're burning more time in the sync than you saved. Next, the time spent in the second thread loading the intermediate value cold from either RAM or L1 cache into a register will exceed the time it would take to perform the addition.

              Given some time, I can easily come up with far more perverse cases that come up in the real world.

              ...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.

              I fail to see how the sync burns more time than you save by threading the computation. It seems to me that doing operation a and operation b in sequence will almost always be slower than doing them simultaneously with one joining the other at the end (or, better and a little trickier, a max-reference count for t

              • by sjames (1099)

                ...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.

                As is not realizing that synchronization costs. How fortunate that I committed none of those errors! Synchronization requires atomic operations. On theoretical (read cannot be built) machines in CS, that may be a free operation. On real hardware, it costs extra cycles.

                As for cache assumptions, I am assuming that liner access to linear memory will result in cache hits. That's hardly a stretch to think so given the way memory and cache are laid out these days.

                If you are suggesting that handing off those subt

                • I never suggested that synchronization is free. However, a CAS or other (x86-supported!) atomic instruction would suffice, so you are talking about one extra cycle and a cache read (in the worst case) for the benefit of working (at least) twice as fast; you will benefit from extra cores almost linearly until you've got the entire thing in cache.

                  The cache stuff is pretty straightforward. More CPUs = more cache = more cache hits. Making the assumption that a[], b[], and c[] are contiguous in memory only i

                  • by sjames (1099)

                    This is ignoring the trivially shallow dependance of the originally proposed computation (there's a simple loop invariant) and making the assumption that a difficult computation is being done.

                    I put the dependence there because it reflects the real world. For example, any iterative simulation. I could prove a lot of things if I get to start with ignoring reality. You asserted that there existed no case where a single thread performs as well as multiple threads, a most extraordinary claim. It's particularly extraordinary given that it actually claims that all problems are infinitely scalable with only trivial optimization.

                    CAS is indeed an atomic operation that could be used (i would have used a s

                    • by tarpitcod (822436)


                      It's all easy if you ignore:

                      Pipeline stalls
                      Dynamic clock throttling on cores
                      Interconnect delays
                      Timing skews

                      It's the same problems as the async CPU people go through, except everyone is wearing rose-colored-spectacles and acting like there still playing with nice synchronous clocking.

                      The semantics become horrible once you start stringing together bazillions of commodity CPU's. Guaranteeing the dependencies are satisfied becomes non-trivial like you say even for a single multi-core x86 p

                    • by sjames (1099)

                      Agreed. I'm really glad MPP machines are out there, there is a wide class of jobs that they do handle decently well for a tiny fraction of the cost. In fact, I've been specifying those for years (mostly a matter of figuring out where the budget is best spent given the expected workload and estimating the scalability) but as you say, it is also important to keep in mind that there is a significant class of problem they can't even touch. Meanwhile, the x86 line seems top have hit the wall at a bit over 3GHz c

                    • by tarpitcod (822436)

                      Some of the new ARM cores are getting interesting. I do wonder how much market share from x86 ARM will win. Your right about the DDR specs smelling like QAM. They are doing a great job at getting more bandwidth but the latency stucks worse than ever. When it gets too much we will finally see processors distributed in memory and Cray 3/SSS here we come...

                      I keep thinking more and more often that Amdahls 'wafer scale' processor needs to be revisited. If you could build a say 3 centimeter square LN2 coole

                    • by sjames (1099)

                      The key part there is getting the memory up to the CPU speed. On-die SRAM is a good way to do that. It's way too expensive for a general purpose machine, but this is a specialized application. A few hundred MB would go a long way, particularly if either a DMA engine or separate general purpose CPU was handling transfers to a larger but higher latency memory concurrently. By making the local memory large enough and manually manageable with concurrent DMA, it could actually hide the latency of DDR SDRAM.

                      For a

                    • by tarpitcod (822436)

                      I thought about this some more and came to the same conclusion re external memory. I was trying to weigh the relative merit of very fast very small (Say 4K instructions) channel processors that can stream memory into the larger SRAM banks. The idea would be DMA on steroids. If your going to build a DMA controller and have the transistor budget then replacing a DMA unit with a simple in-order fast core might be a win, especially if it was fast enough that you could do bit vector stuff / record packing and

                    • by sjames (1099)

                      The channel controllers are a good idea. One benefit to that is there need be no real distinction between accessing another CPU's memory and an external RAM other than the speed/latency. So long as all off-chip access is left to the channel controllers with the CPU only accessing it's on-die memory, variable timing off chip wouldn't be such a big problem. Only the channel controller would need to know.

                      The SDRAM memory controller itself and all the pins necessary to talk to SDRAM modules can be external to t

  • That little Cray thing looks really nice. Nice work, whoever did it. Reminds me of '90s side-scrolling games for some reason.

  • by ka9dgx (72702) on Friday January 28, 2011 @01:39PM (#35034602) Homepage Journal

    I read what I thought were the relevant sections of the big PDF file that went along with the article. They know that the actual RAM cell power use would only be 200 KW for an exabyte, but the killer comes when you address it in rows, columns, etc... then it goes to 800KW, and then when you start moving it off chip, etc... it gets to the point where it just can't scale without running a generating station just to supply power.

    What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it... it would waste some logic gates, but the power savings would be more than worth it.

    Instead of having 8kit rows... just a 16x4 bit look up table would be the basic unit of computation. Globally read/writable at setup time, but otherwise only accessed via single bit connections to neighboring cells. Each cell would be capable of computing 4 single bit operations simultaneously on the 4 bits of input, and passing them to their neighbors.

    This bit processor grid (bitgrid) is turing complete, and should be scalable to the exaflop scale, unless I've really missed something. I'm guessing somewhere around 20 megawatts for first generation silicon, then more like 1 megawatt after a few generations.

    • by mangu (126918)

      Non-von-Neumann supercomputers have been built, look at this hypercube topology [].

      The problem is the software, we have such a big collection of traditional libraries that it becomes hard to justify starting over in an alternative way.

    • by Animats (122034) on Friday January 28, 2011 @02:07PM (#35035024) Homepage

      What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it.

      It's been tried. See Thinking Machines Corporation []. Not many problems will decompose that way, and all the ones that will can be decomposed onto clusters.

      The history of supercomputers is full of weird architectures intended to get around the "von Neumann bottleneck". Hypercubes, SIMD machines, dataflow machines, associative memory machines, perfect shuffle machines, partially-shared-memory machines, non-coherent cache machines - all were tried, and all went to the graveyard of bad supercomputing ideas.

      The two extremes in large-scale computing are clusters of machines interconnected by networks, like server farms and cloud computing, and shared-memory multiprocessors with hardware cache consistency, like almost all current desktops and servers. Everything else, with the notable exception of GPUs, has been a failure. Even the Cell, the most widely deployed non-standard architecture ever, was only used in the PS3, and was more trouble than it was worth.

      • by Mechanik (104328)

        Even the Cell, the most widely deployed non-standard architecture ever, was only used in the PS3, and was more trouble than it was worth.

        I think you are forgetting about the Roadrunner supercomputer [] which has 12,960 PowerXCell processors. It was #1 on the supercomputer Top 500 in 2008. It's still at #7 as of November 2010.

    • by ka9dgx (72702)

      All of the examples you all gave to this point are still conventional CPUs with differences in I/O routing.

      I'm proposing something with no program counter, no stack, etc... just pure logic computation.

      And no, it's not an FPGA because those all have lots of routing as well.

  • I guess Peter Cogge doesn't keep up with current events in the tech industry like this one: []
  • I liked this computer before, when it was called a beowolf cluster.

    • indeed. and when it was cheap(er) than supercomputers. and when supercomputer vendors looked down their noses at it.

  • by Waffle Iron (339739) on Friday January 28, 2011 @02:53PM (#35035782)

    That doesn't seem like a show stopper. In the 1950s, the US Air Force built over 50 vacuum tube SAGE computers for air defense. Each one used up to 3 MW of power and probably wasn't much faster than an 80286. They didn't unplug the last one until the 1980s.

    If they get their electricity wholesale at 5 cents/kWh, 67 MW would cost about $30,000,000 per year. That's steep, but probably less than the cost to build and staff the installation.

    • by Peeteriz (821290)

      As the article says, it's not the power requirements but the heat that worries them.

      67 MW of heat spread out in 50 buildings is ok; 67 MW of heat in a shared-memory device that needs to be physically small and compact for latency reasons may make it impossible.

  • And while giant high end supercomputers may progress more slowly, we're slowly seeing a revolution in personal supercomputing, where everyone can have a share of the pie. Witness CUDA, OpenCL, and projects like GPU.NET [] (.NET for the GPU, and apparently easy to use, though expensive for now).

    Along with advancements such as multitasking in the next generation of GPUs (yes, they can't actually multitask yet, but when they do it'll be killer for a few reasons), and a shared memory with the CPU (by combining
  • His statements are both true and false. Its true that exaflops is a big challenge, however, research on supercomputers has not stopped. But there are other areas which are being looked at too. For example - algorithms. Whenever a new supercomputer is developed, parallel programmers try to modify or come up with new algorithms that take advantage of the architectures/network speeds to make things faster. Heck, there are some applications that have started looking at avoiding huge computations and instead goi
  • I've read the article (the WHOLE article) and the exaflop issue is generally posed in terms of power requirements in reference to current silicon technlogy and its most strictly related future advancements. The caveat of that is that not even IBM thinks exaflop computing can be achieved with current technology, that's why they are deeply involved with photonic CMOS, of which they have already made the first working prototype. Research into exaflop computing in IBM is largely based on that. You can't achiev
  • Check out the Limulus Project []
  • Well said, sir.


  • by Yaos (804128) on Friday January 28, 2011 @06:02PM (#35038404)
    Why has nobody tried this before? They could easily plow through the data from SETI, fold proteins, or even have a platform for creating and distributing cloud based computing turnkey computing solutions! It's too bad that the cloud was not invented until a year or two ago, this stuff could have probably started out in 1999 if the cloud existed back then.
    • by WorBlux (1751716)
      Because the word cloud really doesn't refer to any new innovation it's marketing, it is just a new term on an old idea.. Cloud just either means a distributed or non-trivial client-server computation over the public internet. It's been around forever. SETI already makes use of what could be describes as cloud computing. The reason now rather than then is the ubiquitousness of broadband, machines with significant idle, and an increase in the number of programmers who now who to split very large problems
    • by km_2_go (1404213)

      Mod parent up. For those whose head this has flown over, distributed supercomputing is not a new idea. It has been implemented, most famously with SETI@home, for quite a while. "Cloud" is merely a new word for an old concept.

  • I disclosed this sort-of-cogeneration idea before on the open manufacturing list so that no one could patent it, but for years I've been thinking that the electric heaters in my home should be supercomputer nodes (or doing other industrial process work), controlled by thermostats (or controlled by some algorithm related to expectations of heat needs).

    When we want heat, the processors click on and do some computing and we get the waste heat to heat our home. When the house is warm enough, they shut down. The

"Stupidity, like virtue, is its own reward" -- William E. Davidsen