Supercomputer Advancement Slows? 86
kgeiger writes "In the Feb. 2011 issue of IEEE Spectrum online, Peter Kogge, an IEEE Fellow and professor of computer science and engineering at the University of Notre Dame, outlines why we won't see exaflops computers soon. To start with, consuming 67 MW (an optimistic estimate) is going to make a lot of heat. He concludes, 'So don't expect to see a supercomputer capable of a quintillion operations per second appear anytime soon. But don't give up hope, either. [...] As long as the problem at hand can be split up into separate parts that can be solved independently, a colossal amount of computing power could be assembled similar to how cloud computing works now. Such a strategy could allow a virtual exaflops supercomputer to emerge. It wouldn't be what DARPA asked for in 2007, but for some tasks, it could serve just fine.'"
just make a cluster... (Score:2)
...of all the existing supercomputers.
The real reason.. (Score:1)
Is that Crysis 2 isn't out yet. When it is, people will be all going out to buy their own supercomputer to run the game.
Less of a matter of can't, but won't (Score:3)
In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.
Now, most applications are able to be done by COTS hardware. Because of this, there isn't much of a push to keep building faster and faster computers.
So, other than the guys who need the top of the line CPU cycles for very detailed models, such as the modelling used to simulate nuclear testing, there isn't really as big a push for supercomputing as there was in the past.
Comment removed (Score:5, Interesting)
Re: (Score:1)
Re: (Score:1)
benchmarks aren't real work. and sadly the tail is wagging the dog to a great extent as people design computers to be good at benchmarks, rather than being as good at a real workload as possible and designing the benchmark to resemble the workload. it's a contest of napoleon complexes.
i'd judge an architecture not by their slot on the benchmarks lists, but by the number and complexity of real workloads they actually are used for.
Re: (Score:2)
More and more applications all the time (Score:3)
In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.
Now, most applications are able to be done by COTS hardware
It's true, many applications that needed supercomputers in the past can be done by COTS hardware today. But this does not mean there are no applications for bigger computers. As each generation of computers assume the tasks done by the former supercomputers, new applications appear for the next supercomputer.
Take weather modeling, for instance. Today we still can't predict rain accurately. That's not because the modeling itself is not accurate, but because the spatial resolution needed to predict rainfall b
Re: (Score:2)
Re: (Score:2)
I think as competition grows in the cloud computing market we'll see a lot more modeling being done on the cloud. There's a lot to be said about having your own supercomputer for sure, but if I can get it done at a fraction of the cost by renting off-peak hours on Amazon's cloud... I'm convinced the future is there, it'll just take us another decade to migrate off our entirely customized and proprietary environments we see today.
Depends on the problem. Some things work well with highly distributed architectures like a cloud (e.g., exploring a "space" of parameters where there's not vast amounts to do at each "point") but others are far better off with traditional supercomputing (computational fluid dynamics is the classic example, of which weather modeling is just a particular type). Of course, some of the most interesting problems are mixes, such as pipelines of processing where some stages are embarrassingly distributable and oth
Re: (Score:2)
The cloud can handle a small subset well, the embarrassingly parallel workloads. For other simulations, the cloud is exactly the opposite of what's needed.
It doesn't matter how fast all the cpus are is they're all busy waiting for the network latency. 3 to 10 microseconds is a reasonable latency in these applications.
Re: (Score:3)
T
Re: (Score:2)
"They" (we all know who "they" are) want a panexaflop (one trillion exaflop) machine to break todays encryption technology (for the children/security/public safety), of course after "they" spend umpteen billions (lotsa billions) some crypto nerd working on his mom's PC will take crypto to a whole new level, and off we go again!
Re: (Score:3)
I think the problem here is in calling these "applications". Most super computers are used to run "experiments". Scientists are always going to want to push to the limits of what they can compute. They're unlikely to just think that because a modern desktop is as fast as a super computer a couple decades ago, that they are fine just running the same numbers they ran a couple decades ago too.
Re: (Score:3)
So wait. Your answer to "very expensive general purpose machine" is "design many slightly less expensive single purpose machines"? Your "factor of hundred" performance improvement will likely be overshadowed by the "factor of thousand" increase in economic cost.
Provide believable numbers or your argument is bullshit. You may be right, but your style of discourse requires concrete evidence to be at all convincing.
Re: (Score:2)
This is not confined to the computer industry (and not news as well).
See "The Logic of Failure", 1996 (with roots back to TANALAND, appr. 1980?)
CC.
slowing... (Score:1)
Re: (Score:2)
No, not really. "Shit happened" is about all that history really shows us. With the correct set of selected examples, it could probably also show us that things stall and stagnate so something else can provide the next large wave of advances.
Re: (Score:1)
Somewhat unrelated to the article I suppose, and I never thought I'd say this but buying a new computer is...boring.
There I said it.
I remember a time when if you waited three years, and got a computer, the difference
Moore's law meets Amadal's law (Score:2)
Current super computers are limited by consumer technology. Adding cores is already running out of steam on the desk top. On servers it works well be cause we are using them mainly for virtualization. Eight and sixteen core CPUs will boarder on useless on the desktop unless some significant change takes place in software to use them.
Re: (Score:2)
Re: (Score:2)
Actually if you read the link the problem is that the interconnects and that lack limits of parallelizm of the problems are the limitations.
"The good news is that over the next decade, engineers should be able to get the energy requirements of a flop down to about 5 to 10 pJ. The bad news is that even if we do that, it won't really help. The reason is that the energy to perform an arithmetic operation is trivial in comparison with the energy needed to shuffle the data around, from one chip to another, from
Rent Out My Machine (Score:2)
I would be very willing to run something akin to Folding@Home where I get paid for my idle computing power. Why build a super computing cluster when, for some applications, the idle CPU power of ten million consumer machine is perfectly adequate? Yes, there needs to be some way to verify the work, otherwise you could have cheating or people trolling the system, but it can't be too hard a problem to solve.
Re: (Score:3)
1. The value of the work your CPU can do is probably less than the extra power it'll consume. Maybe the GPU could it, but then:
2. You are not a supercomputer. Computing power is cheap - unless you're running a cluster of GPUs, it could take a very long time for you to earn even enough to be worth the cost of the payment transaction.
What you are talking about is selling CPU time. It's only had one real application since the days of the mainframe days, and that's in cloud computing as it offe
Re:Rent Out My Machine (Score:4, Insightful)
Because nobody uses a real supercomputer for that kind of work. It's much cheaper to buy some processing from Amazon or use a loosely coupled cluster, or write an @Home style app.
Supercomputers are used for tasks where fast communication between processors is important, and distributed systems don't work for these tasks.
So the answer to your question is that tasks that are appropriate for distributed computing are already done that way (and when lots of people are willing to volunteer, why would they pay you?).
Re: (Score:1)
That kind of thing (grid computing) is only good for 'embarrassingly parallel' problem. You cannot solve large coupled partial differential equation problems because the required communications. And most of problems in nature is large coupled PDE.
Re:How about those limited edition Gallium chips?? (Score:4, Informative)
A little bird informs the world that the US has a supercomputer already running on them, somewhere between 100Ghz-1Thz per processor
Unlikely. If you do the calculations, you'll find that the current 3GHz limit is about as fast as you can get data from other chips on a circuit board. 3GHz is 0.33 nanoseconds period, the time it takes for light to travel ten centimeters in a vacuum. A faster CPU will stay idle most of the time, waiting for the data it requested from other chips to arrive at the speed of light.
How Long (Score:1)
LA TE N C Y I S F O R E V (Score:3, Insightful)
These modern machines which consist of zillions of cores attached over very low bandwidth and high latency link are really not supercomputers for a huge class of applications. Unless your application exhibits extreme memory locality and hardly any interconnect bandwidth / can tolerate long latencies.
The current crop of machines is driven mostly by marketing folks and not by people who really want to improve the core physics like Cray used to.
BANDWIDTH COSTS MONEY, LATENCY IS FOREVER
Take any of these zillion dollar plies of CPU's and just try doing this: .lt. bounds; ++x )
for ( x=0; x
{
humungousMemoryStructure [ x ] = humungousMemoryStructure1 [ x ] * humungousMemoryStructure2 [ randomAddress ] + humungousMemoryStructure3 [ anotherMostlyRandomAddress ] ;
}
It'll suck eggs. You'd be better off with a single liquid nitrogen cooled GaAs/ECL processor surrounded by the fastest memory you can get your hands on all packed into the smallest place you can and cooled with LN or LHe.
Half the problem is that everyone measures performance for publicity with LINPACK MFLOPS. It's a horrible metric.
If you really want to build a great new supercomputer get a (smallish) bunch of smart people together like Cray did, and focus on improving the core issues. Instead of spending all your erfforts on hiding latency, tackle it head on. Figure out how to build a fast processor and cool it. Figure out how to surround it with memory.
Yes,
Customers will still use commodity MPP machines for the stuff that parallelizes.
Customers will still hire mathematicians, and have them look at ways to Map things that seem inherently non local into spaces that are local.
Customers who have money and the mathematicians couldn't help will need your company and your GaAs/ECL or LHe cooled fastest SCALAR / Short Vector box in the world.
Re: (Score:2)
Well, yeah, if you deliberately design a program to not take advantage of the architecture it's running on, then it won't take advantage of the architecture it's running on. (This, btw, is one of the great things about Linux, but that's not really what we're talking about.)
One mistake you're making is in assuming only one kind of general computing improvement can be occurring at a time (and there is some good, quality irony in that *grin*). Cray (and others) can continue to experiment on the edge of the t
Re: (Score:1)
I hear you but sure, you can harness a million chickens over slow links and reinvent the transputer, or Illiac IV but your then constraining yourself to problems where someone can actually get a handle on the locality. If they can't your *screwed* if you want to actually really improve your ability to answer hard problems in a fixed amount of time.
You can even just take your problem, and brute force parallelize it and say 'wow lets run it for time steps 1..1000' and farm that out to your MPP or your cluste
Re: (Score:2)
You do realize that if you go off-node on your cluster even over infiniband the 1uS is about equal to a late 1960's core memory access time right?
Sure, but having 1960 mag core access to entirely different systems is pretty good, I'd say. And it will only improve.
It's a false dichotomy. There are some problems that clusters are bad at. That is true. The balancing factor that you are missing is that there are problems that single-proc machines are bad at, also. For every highly sequential problem we know, we also know of a very highly parallel one. There are questions that cannot be efficiently answered in a many-node environment, but there are
Re: (Score:2)
a single thread of execution is GUARANTEED to be slower than even the most trivially optimized multithreaded case.
That is true if and only if the cost of multithreading doesn't include greatly increased latency or contention. Those are the real killers. Even in SMP there are cases where you get eaten alive with cache ping ponging. The degree to which the cache, memory latency, and lock contention matter is directly controlled by the locality of the data.
For an example, let's look at this very simple loop:
FOR i=1to100
A[i]=B[i]+c[i]+A[i-1]
You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] i
Re: (Score:2)
You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] in another, but you then have 2 problems. First, if you aren't doing a barrier sync in the loop the second thread might pass the first and the result is junk, but if you are, you're burning more time in the sync than you saved. Next, the time spent in the second thread loading the intermediate value cold from either RAM or L1 cache into a register will exceed the time it would take to perform the addition.
Given some time, I can easily come up with far more perverse cases that come up in the real world.
...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.
I fail to see how the sync burns more time than you save by threading the computation. It seems to me that doing operation a and operation b in sequence will almost always be slower than doing them simultaneously with one joining the other at the end (or, better and a little trickier, a max-reference count for t
Re: (Score:2)
...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.
As is not realizing that synchronization costs. How fortunate that I committed none of those errors! Synchronization requires atomic operations. On theoretical (read cannot be built) machines in CS, that may be a free operation. On real hardware, it costs extra cycles.
As for cache assumptions, I am assuming that liner access to linear memory will result in cache hits. That's hardly a stretch to think so given the way memory and cache are laid out these days.
If you are suggesting that handing off those subt
Re: (Score:2)
I never suggested that synchronization is free. However, a CAS or other (x86-supported!) atomic instruction would suffice, so you are talking about one extra cycle and a cache read (in the worst case) for the benefit of working (at least) twice as fast; you will benefit from extra cores almost linearly until you've got the entire thing in cache.
The cache stuff is pretty straightforward. More CPUs = more cache = more cache hits. Making the assumption that a[], b[], and c[] are contiguous in memory only i
Re: (Score:2)
This is ignoring the trivially shallow dependance of the originally proposed computation (there's a simple loop invariant) and making the assumption that a difficult computation is being done.
I put the dependence there because it reflects the real world. For example, any iterative simulation. I could prove a lot of things if I get to start with ignoring reality. You asserted that there existed no case where a single thread performs as well as multiple threads, a most extraordinary claim. It's particularly extraordinary given that it actually claims that all problems are infinitely scalable with only trivial optimization.
CAS is indeed an atomic operation that could be used (i would have used a s
Re: (Score:1)
Exactly!
It's all easy if you ignore:
Cache-misses.
Pipeline stalls
Dynamic clock throttling on cores
Interconnect delays
Timing skews
It's the same problems as the async CPU people go through, except everyone is wearing rose-colored-spectacles and acting like there still playing with nice synchronous clocking.
The semantics become horrible once you start stringing together bazillions of commodity CPU's. Guaranteeing the dependencies are satisfied becomes non-trivial like you say even for a single multi-core x86 p
Re: (Score:2)
Agreed. I'm really glad MPP machines are out there, there is a wide class of jobs that they do handle decently well for a tiny fraction of the cost. In fact, I've been specifying those for years (mostly a matter of figuring out where the budget is best spent given the expected workload and estimating the scalability) but as you say, it is also important to keep in mind that there is a significant class of problem they can't even touch. Meanwhile, the x86 line seems top have hit the wall at a bit over 3GHz c
Re: (Score:1)
Some of the new ARM cores are getting interesting. I do wonder how much market share from x86 ARM will win. Your right about the DDR specs smelling like QAM. They are doing a great job at getting more bandwidth but the latency stucks worse than ever. When it gets too much we will finally see processors distributed in memory and Cray 3/SSS here we come...
I keep thinking more and more often that Amdahls 'wafer scale' processor needs to be revisited. If you could build a say 3 centimeter square LN2 coole
Re: (Score:2)
The key part there is getting the memory up to the CPU speed. On-die SRAM is a good way to do that. It's way too expensive for a general purpose machine, but this is a specialized application. A few hundred MB would go a long way, particularly if either a DMA engine or separate general purpose CPU was handling transfers to a larger but higher latency memory concurrently. By making the local memory large enough and manually manageable with concurrent DMA, it could actually hide the latency of DDR SDRAM.
For a
Re: (Score:1)
I thought about this some more and came to the same conclusion re external memory. I was trying to weigh the relative merit of very fast very small (Say 4K instructions) channel processors that can stream memory into the larger SRAM banks. The idea would be DMA on steroids. If your going to build a DMA controller and have the transistor budget then replacing a DMA unit with a simple in-order fast core might be a win, especially if it was fast enough that you could do bit vector stuff / record packing and
Re: (Score:2)
The channel controllers are a good idea. One benefit to that is there need be no real distinction between accessing another CPU's memory and an external RAM other than the speed/latency. So long as all off-chip access is left to the channel controllers with the CPU only accessing it's on-die memory, variable timing off chip wouldn't be such a big problem. Only the channel controller would need to know.
The SDRAM memory controller itself and all the pins necessary to talk to SDRAM modules can be external to t
I like the new supercomputing graphic (Score:2)
That little Cray thing looks really nice. Nice work, whoever did it. Reminds me of '90s side-scrolling games for some reason.
It's Von Neuman's fault (Score:3)
I read what I thought were the relevant sections of the big PDF file that went along with the article. They know that the actual RAM cell power use would only be 200 KW for an exabyte, but the killer comes when you address it in rows, columns, etc... then it goes to 800KW, and then when you start moving it off chip, etc... it gets to the point where it just can't scale without running a generating station just to supply power.
What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it... it would waste some logic gates, but the power savings would be more than worth it.
Instead of having 8kit rows... just a 16x4 bit look up table would be the basic unit of computation. Globally read/writable at setup time, but otherwise only accessed via single bit connections to neighboring cells. Each cell would be capable of computing 4 single bit operations simultaneously on the 4 bits of input, and passing them to their neighbors.
This bit processor grid (bitgrid) is turing complete, and should be scalable to the exaflop scale, unless I've really missed something. I'm guessing somewhere around 20 megawatts for first generation silicon, then more like 1 megawatt after a few generations.
Re: (Score:2)
Non-von-Neumann supercomputers have been built, look at this hypercube topology [wikipedia.org].
The problem is the software, we have such a big collection of traditional libraries that it becomes hard to justify starting over in an alternative way.
Re:It's Von Neuman's fault (Score:4, Interesting)
What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it.
It's been tried. See Thinking Machines Corporation [wikipedia.org]. Not many problems will decompose that way, and all the ones that will can be decomposed onto clusters.
The history of supercomputers is full of weird architectures intended to get around the "von Neumann bottleneck". Hypercubes, SIMD machines, dataflow machines, associative memory machines, perfect shuffle machines, partially-shared-memory machines, non-coherent cache machines - all were tried, and all went to the graveyard of bad supercomputing ideas.
The two extremes in large-scale computing are clusters of machines interconnected by networks, like server farms and cloud computing, and shared-memory multiprocessors with hardware cache consistency, like almost all current desktops and servers. Everything else, with the notable exception of GPUs, has been a failure. Even the Cell, the most widely deployed non-standard architecture ever, was only used in the PS3, and was more trouble than it was worth.
Re: (Score:2)
I think you are forgetting about the Roadrunner supercomputer [wikipedia.org] which has 12,960 PowerXCell processors. It was #1 on the supercomputer Top 500 in 2008. It's still at #7 as of November 2010.
Re: (Score:2)
All of the examples you all gave to this point are still conventional CPUs with differences in I/O routing.
I'm proposing something with no program counter, no stack, etc... just pure logic computation.
And no, it's not an FPGA because those all have lots of routing as well.
I think he's wrong. (Score:1)
Sounds familiar... (Score:2)
I liked this computer before, when it was called a beowolf cluster.
Re: (Score:1)
indeed. and when it was cheap(er) than supercomputers. and when supercomputer vendors looked down their noses at it.
67 Megawatts? (Score:3)
That doesn't seem like a show stopper. In the 1950s, the US Air Force built over 50 vacuum tube SAGE computers for air defense. Each one used up to 3 MW of power and probably wasn't much faster than an 80286. They didn't unplug the last one until the 1980s.
If they get their electricity wholesale at 5 cents/kWh, 67 MW would cost about $30,000,000 per year. That's steep, but probably less than the cost to build and staff the installation.
Re: (Score:2)
As the article says, it's not the power requirements but the heat that worries them.
67 MW of heat spread out in 50 buildings is ok; 67 MW of heat in a shared-memory device that needs to be physically small and compact for latency reasons may make it impossible.
No worries... (Score:2)
Along with advancements such as multitasking in the next generation of GPUs (yes, they can't actually multitask yet, but when they do it'll be killer for a few reasons), and a shared memory with the CPU (by combining
He forgets about software developments (Score:2)
Yes, a forecast with CURRENT technology (Score:2)
Yes but what about your desktop? (Score:2)
Well said. (Score:1)
Well said, sir.
Jim
A virtual cloud based super computer? (Score:3, Funny)
Re: (Score:1)
Re: (Score:1)
Mod parent up. For those whose head this has flown over, distributed supercomputing is not a new idea. It has been implemented, most famously with SETI@home, for quite a while. "Cloud" is merely a new word for an old concept.
Re: (Score:1)
Re: (Score:2)
- What about a super computer made out of FPGAs ?
It's been done.... more than once, or twice.
New FPGA-based Supercomputer in Scotland [insidehpc.com]
SGI Builds World's Largest FPGA Supercomputer [sgi.com]
Use computing nodes as electric heaters (Score:2)
I disclosed this sort-of-cogeneration idea before on the open manufacturing list so that no one could patent it, but for years I've been thinking that the electric heaters in my home should be supercomputer nodes (or doing other industrial process work), controlled by thermostats (or controlled by some algorithm related to expectations of heat needs).
When we want heat, the processors click on and do some computing and we get the waste heat to heat our home. When the house is warm enough, they shut down. The