Forgot your password?
typodupeerror
Supercomputing Hardware

Supercomputers' Growing Resilience Problems 112

Posted by samzenpus
from the a-thousand-potential-cuts dept.
angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."
This discussion has been archived. No new comments can be posted.

Supercomputers' Growing Resilience Problems

Comments Filter:
  • by poetmatt (793785) on Wednesday November 21, 2012 @09:04PM (#42062245) Journal

    The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.

    How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.

  • Not Really New (Score:4, Insightful)

    by Jah-Wren Ryel (80510) on Wednesday November 21, 2012 @11:07PM (#42063199)

    The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.

    Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.

    On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.

  • by Anonymous Coward on Thursday November 22, 2012 @02:12AM (#42064105)
    I think you're missing the fact that when a node dies in the middle of a 1024-core job that has been running for 12h you normally lose all the MPI processes and, unless the job has been checkpointed, everything that has been computed so far.

    It's not just about hunting and replacing the dead nodes, it's about the jobs' resilience to failure.

All warranty and guarantee clauses become null and void upon payment of invoice.

Working...