Supercomputers' Growing Resilience Problems

Supercomputers' Growing Resilience Problems 112

Posted by samzenpus on Wednesday November 21, 2012 @08:01PM from the a-thousand-potential-cuts dept.

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

Supercomputers' Growing Resilience Problems

This discussion has been archived. No new comments can be posted.

Search 112 Comments Log In/Create an Account

Comments Filter:

Re:Hardly A New Problem...and thus has been fixed (Score:4, Insightful)

by poetmatt ( 793785 ) writes: on Wednesday November 21, 2012 @09:04PM (#42062245) Journal

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.
How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.

Not Really New (Score:4, Insightful)

by Jah-Wren Ryel ( 80510 ) writes: on Wednesday November 21, 2012 @11:07PM (#42063199)

The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.
Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.
On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.

Re:Hardly A New Problem (Score:2, Insightful)

by Anonymous Coward writes: on Thursday November 22, 2012 @02:12AM (#42064105)

I think you're missing the fact that when a node dies in the middle of a 1024-core job that has been running for 12h you normally lose all the MPI processes and, unless the job has been checkpointed, everything that has been computed so far.

It's not just about hunting and replacing the dead nodes, it's about the jobs' resilience to failure.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Supercomputers' Growing Resilience Problems 112

Supercomputers' Growing Resilience Problems More Login

Supercomputers' Growing Resilience Problems

Re:Hardly A New Problem...and thus has been fixed (Score:4, Insightful)

Not Really New (Score:4, Insightful)

Re:Hardly A New Problem (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot