Supercomputers' Growing Resilience Problems

Supercomputers' Growing Resilience Problems 112

Posted by samzenpus on Wednesday November 21, 2012 @08:01PM from the a-thousand-potential-cuts dept.

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

Supercomputers' Growing Resilience Problems

This discussion has been archived. No new comments can be posted.

Search 112 Comments Log In/Create an Account

Comments Filter:

Hardly A New Problem (Score:5, Informative)

by MightyMartian ( 840721 ) writes: on Wednesday November 21, 2012 @08:04PM (#42061787) Journal

Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.

"and they halt operations when they do so" (Score:5, Informative)

by brandor ( 714744 ) writes: on Wednesday November 21, 2012 @08:29PM (#42061967)

This is only true in certain types of supercomputers. The only one we have that will do this is an SGI UV-1000. It surfaces groups of blades as a single OS image. If one goes down, the kernel doesn't like it.
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.

Re:Old problem (Score:2, Informative)

by Anonymous Coward writes: on Wednesday November 21, 2012 @09:18PM (#42062353)

you start the job over.
You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.
On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirror the disks...rerunning the jobs that failed is cheaper than increasing the node price to add things that only marginally improve things and also increase physical size.
If you have a single process bigger than that you need to setup a checkpointing system.
If you can split big jobs into lots of smaller pieces that can be pretty quickly put together at the end you do so.
On the previous one I was on they used both tricks depending on the exact nature of what was being processed.
For the most part it is not a complicated problem unless you expect unreasonably low failure rates and don't deal with reality.

Re:Hardly A New Problem...and thus has been fixed (Score:4, Informative)

by markhahn ( 122033 ) writes: on Wednesday November 21, 2012 @11:37PM (#42063371)

"hegemonous", wow.
I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.

Re:"and they halt operations when they do so" (Score:2, Informative)

by Anonymous Coward writes: on Thursday November 22, 2012 @12:20AM (#42063613)

Pretty much all MPI-based codes are vulnerable to single node failure. Shouldn't be that way but it is. Checkpoint-restart doesn't work when the time to write out the state is greater than MTBF. The fear is that's the path we're on, and will reach that point within a few years.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Supercomputers' Growing Resilience Problems 112

Supercomputers' Growing Resilience Problems More Login

Supercomputers' Growing Resilience Problems

Hardly A New Problem (Score:5, Informative)

"and they halt operations when they do so" (Score:5, Informative)

Re:Old problem (Score:2, Informative)

Re:Hardly A New Problem...and thus has been fixed (Score:4, Informative)

Re:"and they halt operations when they do so" (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot