Supercomputers' Growing Resilience Problems 112
angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."
Hardly A New Problem (Score:5, Informative)
Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.
Re: (Score:1)
As always, if you condense everything, you can expect that one piece in the whole system that breaks, causes failures in the rest of the system.
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
Re:Hardly A New Problem...and thus has been fixed (Score:4, Insightful)
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.
How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.
Re: (Score:3)
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down.
You might think so, but I've seen a configuration with an interconnect fabric that was extremely sensitive to the fallback of individual links to the next lower link speed cause all sorts of havoc cluster wide.
Re: (Score:2)
Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?
I understand that there's little the processor can do if the mobo dies (unless it notices an increase in failed computations right before maybe it could send a signal) but is there any advantage to the type of processor used in some cases?
Just curious...
Cheers
Re: (Score:2)
Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?
No, not related to the processor, just a specific inter-node interconnect fabric implementation. The moral of the story is: just be aware that buying stuff that has just become available and trying to deploy it at scales well beyond what others (including the vendor) have done before leads to an effectively experimental configuration, and is not one that you should expect to behave like a production environment should for a period of time until the kinks are worked out. Of course, this plays hell with the
Re:Hardly A New Problem...and thus has been fixed (Score:4, Informative)
"hegemonous", wow.
I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.
Re: (Score:2)
Checkpoint/restart is actually a rather poor workaround as the ration of IO bandwidth to compute performance is shrinking with every new generation of supercomputers. Soon enough we'll spend more time writing checkpoints than doing actual computations.
I personally believe that we'll see some sort of redundancy on the node level in the mid-term future (i.e. the road to exascale), which will sadly require source code-level adaptation.
Re: (Score:2)
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability.
Yeah, no surprise there. Historically, kings have never cared about what happens to the peons.
Re: (Score:2)
I don't think you understand the nature of the problem.
The more nodes N you use for a computation and the longer the computation runs, the greater the odds that a node will fail before the computation can complete. Small programs ignore the problem and usually work just fine. Should a node crash, the job is re-run from scratch.
More sophisticated programs write out a checkpoint periodically so that if (when) a node fails, it is replaced and the job restarts from the checkpoint. However, that is not without c
easy solution (Score:3)
Just make checkpointing cost zero time. How, have each node, really be a dual node ( like most things in nature are ). So one half is always computing, and the other half is check pointing. Just like video games use of double buffering to do smooth fps. If checkpointing uses less cpu than comps, then swap the cores functionality every N runs to give each core a temperature rest, to cool down, to increase life.
Sure each node is bigger, but it could perhaps scale better, the overhead curve should be way bett
Re: (Score:2)
That doesn't actually make checkpointing take zero time, it allocates it 50% of the time. meanwhile, in that scheme the cluster must be small enough that most of the time, there isn't a failure in a checkpoint interval.
Re: (Score:2)
Well, it seems obvious that you need to distribute your checkpoints. Instead of saving the global state, you must save smaller subsets of it.
Now, writting a program that does that is no easy task. That's why it is a problem. But don't think it can't be solved, because it can.
Re: (Score:2)
I am already assuming that each node saves it's own state to a neighboring node. That speeds up the checkpoints and improves the scalability but doesn't fundamentally address the problem.
Re: (Score:2)
That does address the problem. Once you each node saves its state to enough other nodes, you don't need to make a checkpoint anymore.
But then, you must write code that can work with just partial checkpoints, what is quite hard.
Re: (Score:2)
Saving state IS checkpointing, you know. And you cannot continue computation with missing state other than by re-computing it (which would be considered part of loading the checkpoint). The checkpoint images must also be coherent. Otherwise, you have lost state.
.
In the end, it just means that the maximum useful size of a cluster is limited by the reliability of the nodes.
Ultimately the problem may be sidestepped by going to non-stop style nodes (expensive as hell) but then your scalability is limited by h
Re: (Score:2)
That's the restriction that must be removed. If your program can work with non-coherent subsets of the checkpoint, you won't have this scalability problem. And it is something possible, for any computation.
Re: (Score:2)
I'll be needing to see proof of that one.
Re: (Score:2)
If you nodes communicate only through message passing (and you can compute anything by communicating only through message passing), when restoring a node if you restore any state after receiving or transmitting the last message, it can't make any difference.
But then, that algorithm is not viable. It creates a perfectly descentralized checkpoint, without needing any coherence, but it needs too many checkpoints. But between enforcing coherence only of a single node, and enforcing coherence of the entire compu
Re: (Score:2)
I agree that it can always be decomposed to message passing, even shared memory is just a particular way of passing messages.
However, the state of a node necessarily includes the state of it's peers. If node A is expecting a reply from B while B is expecting a request from A, the computation fails every time.
The only way to manage that without barriers is for the checkpoint to happen upon receipt of EACH message before it is ACKed (which means state is recovered in-part by re-transmission of unACKed message
Re: (Score:2)
You are right. I was overlooking that problem with the algorithm (one must have the checkpoint before it acks the message, otherwise the node can't restore to "any state after receiving the message").
But now that I've tought about it, transactions are way more interesting than I was assuming.
Re: (Score:2)
Re: (Score:2, Insightful)
It's not just about hunting and replacing the dead nodes, it's about the jobs' resilience to failure.
Re: (Score:2)
Re: (Score:2)
ok
if motherboard #325 dies and kills 8 nodes at once, so if that requires a 12 hour job to re-run why not just have always several dozen spare idle nodes waiting to do that 12hr job in 1hr to catch up.
Or are you saying out of 1 million nodes, 1 fails every 1/2 hour, so you need 24 spare to catchup, which is feasable.
Or are you saying failure rate is 1 every minute? which would require 720 spare nodes every hour , which is a lot of new racks being installed daily.
If it scales linearly, then its ok, and wont
Re: (Score:1)
if motherboard #325 dies and kills 8 nodes at once, so if that requires a 12 hour job to re-run why not just have always several dozen spare idle nodes waiting to do that 12hr job in 1hr to catch up.
Let me rephrase, because I cannot seem to get my point across.
The typical distributed-memory machine has several tens of thousands cores, bundled on, say, 4-core CPUs, bundled on, say, 2-CPU mobos. Each of these PCs is a node. Let us assume your optimistic view that you have spare nodes available.
Now, I have a job that takes 12h to complete using 1024 cores, so 128 nodes. The job runs as 1024 processes, 8 per node, which communicate over the interconnect through a software layer known as MPI. Memory i
Re: (Score:2)
Re: (Score:2)
So you would dedicate say 1 or 2 CPUs per node or a node per rack for sniffing all of the intermediate data off of the local highest speed interconnect as it's sent between nodes, and sending it out to a fifo queue on the (or a separate) network to store the intermediate results in case of a failure?
If you had the last couple chunks of data that every node sent to every other node it would make restarting from a failed node much easier as you would just have to reload the data in, recompute it, and compare
Re: (Score:2)
Re: (Score:2)
Suppose that your job is computing along using about 3/4ths of the memory on node 146 (and every other node in your job) when that node's 4th DIMM dies and the whole node hangs or powers off. Where should the data that was in the memory of node 146 come from in order to be migrated to node 187?
There are a couple of usual options: 1) a checkpoint file written to disk earlier, 2) a hot spare node that held the same data but wasn't fully participating in the process.
Option 2 basically means that you throw away
Re: (Score:2)
Most MPP machines that I am familiar with have a system where the status and functionality of all nodes is checked as part of a supervisory routine and mapped out of the system. Bad Node? It goes on the list for somebody to go out and hot swap out the device. Processing load gets swapped to another machine.
Once the new device is in place that same routine brings that now functioning processor back into the system.
That sort of thing has existed for at least 10 years and probably longer.
"they halt operations when they [fail]" (Score:2)
Ah-ha-ha! You wish failing kit would just up and die!
But really, you can't be sure of that. So things like ECC become minimum requirements, so you might at least have an inkling something isn't working quite right. Because otherwise, your calculations may be off and you won't know about it.
And yeah, propagated error can, and on this sort of scale eventually will, suddenly strike completely out of left field and hit like a tacnuke. At which point we'll all be wondering, what else went wrong and we didn't cat
Re: (Score:1)
global warming simulations come to mind.
Componentry? (Score:3)
Re:Componentry? (Score:5, Funny)
Componentry is embiggened, leading to less cromulent supercomputers?
Re: (Score:2)
only true for "noble componentry"
Re: (Score:2)
No.
an increasing amount of componentry has increased componentry with respect to an increasing number of components.
In other words,
componentry > components
increasing amount > increasing number.
Therefore
componenty + increasing amount >> components + increasing number
however
componentry + increasing number =? increasing amount + components
unfortunately, to precisely determine the complexity of the componentry,
more components (or an increasing number of componentry) with resepct to the original sum
Re: (Score:2)
Other dictionaries may show different results.
Re: (Score:1)
Haw haw haw!
Back in the vacuum tube days, engineers wrote estimates on the maximium size of computers given the half-life of a vacuum tube. 50,000 tubes later, it's running for minutes only between breakdowns.
Is this really a problem these days? (Score:1)
Even my small business has component redundancy. It might slow things down, but surely not grind them to a halt. With all the billions spent and people WAY smarter than me working on those things, I really doubt its as bad as TFA makes it out to be.
Whilst obviously tripling cost... (Score:1)
This just says to me that they need to buy three of every component and run a voting system and throw a fail if one is way off the mark.
Re: (Score:1)
They do. The problem is that is a lot of waste, which does not scale well.
With 1000 nodes, triple redundancy means only ~333 nodes are producing results.
In a couple years, we will be up to 10000 nodes, meaning over 6000 nodes are not producing results.
In a few more years we will be up to 100,000 nodes, meaning 60,000 nodes are not producing results.
Those 60000 nodes are using a lot of resources (power, cooling, not to mention cost) and the issue is they need to develop and implement better methods to do th
Re: (Score:2)
How you respond to a failure is a big deal when you get to systems so large that they're statistically likely to have component failures frequently. It's often unacceptable to just throw out the result and start over. The malfunctioning system needs to be taken offline dynamically and the still-working systems have to compute around it without stopping the process. That's a tricky problem.
"and they halt operations when they do so" (Score:5, Informative)
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.
Re: (Score:2)
Checkpoints won't scale to future generations. But what is amusing is to see some random ph.D student being cited here instead of the people who actually came to that conclusion some time ago already :)
Re: (Score:2)
Checkpoints will probably stick around for quite some time, but the model will need to change. Rather than serializing everything all the way down to a parallel filesystem, the data could potentially be checkpointed to a burst buffer (assuming a per-node design) or a nearby node (experimental SCR design). Of course, it's correct that even this won't scale to larger systems.
I think we'll probably have problems with getting data out to the nodes of the cluster before we start running into problems with checkp
Re: (Score:2)
Re: (Score:2)
Many supercomputers that utilize specialized hardware just can't take component failure. For example, on a Cray XT5, if a single system interconnect link (SeaStar) goes dead the entire system will come to a screeching halt because with SeaStar all the interconnect routes are calculated at boot and can not update during operation. In any tightly coupled system these failures are a real challenge, not just because the entire system may crash, but if users submit jobs requesting 50,000 cores but only 49,900
Re: (Score:2)
The problem is also with the classical Von Neumann model of the state machine. You can have many nodes that do work, then sync-up at different points as dependencies on a JCL- like program. When you have common nodes running at CPU clock, then the amount of buffer and cache that gets dirty is small, and the sync-time is the largest common denominator amongst the calculating nodes. When you bust that model by having an error in one or more nodes, then the sync can't happen until the last node is caught up. O
Re: (Score:1)
This is one of the primary differences between XT (Seastar) and XE/XK (Gemini) systems. A Gemini based blade can go down and traffic will be routed around it. Blades can even be "hotswapped" and added back into the fabric as long as the replacement is identically configured.
Re: (Score:2, Informative)
Pretty much all MPI-based codes are vulnerable to single node failure. Shouldn't be that way but it is. Checkpoint-restart doesn't work when the time to write out the state is greater than MTBF. The fear is that's the path we're on, and will reach that point within a few years.
Re: (Score:2)
Re: (Score:2)
after the few 1000 fails you will get similar curves for time of fail after new.
Whoever did mod the parent up.... (Score:3)
...doesn't understand the first thing about supercomputers, or even HPC. Currently virtually every HPC application uses MPI. And MPI doesn't take well to failing nodes. The supercomputer as a whole might still work, but the job will inevitably crash and needs to be restarted. HPC apps are usually tightly coupled. That sets them apart from loosely coupled codes such as a giant website (e.g. Google and friends)
Fault tolerance is a huge problem in the community and we don't have the answers yet. Some say tha
Re:ummm, no. (Score:4)
Google is having the same problems that this article describes -- they haven't fixed it either.
If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.
On that scale (Score:3)
On that scale, distributed parallelism is key, where the system takes into account downed nodes and removes them from duty until it can return to service, or can easily add a replacement node to handle the stream. That's why Google and Facebook don't go down when a node fails.
Google/FB/etc are Embarassingly Parallel (Score:1)
EP is trivial to deal with.. The problem is with supercomputing jobs that aren't EP, and rely on computations from nodes 1-100 to feed to the next computational step in node 101-200, etc.
The real answer is some form of fault tolerant computing.. think EDAC for memory (e.g. forward error correction) but on a grosser scale. Or, even, designing your algorithm for the ability to "go back N", as in some ARQ communication protocols.
The theory on this is fairly well known (viz Byzantine Generals problem)
The prob
Re: (Score:1)
When you have multiple nodes, you aren't any different than Google.
Google uses Map Reduce but it isn't the only way things get done.
You have standards of coding to deal with the issues. MapReduce is only one of those ways of dealing with the issue.
And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.
Re: (Score:3)
And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.
No it's not. The problem with your description is the "rinse, repeat" part. He's not talking about repeating with new input data. He's talking about a serialized workload where, for example, the output of the first 100 jobs is the input for the next 100 jobs, which then creates output that is the input for the next 100 jobs. It's not a case of repeating, its a case of serialization where if you have not done state check-point and things crater you have to start from the begining to get back where you we
Re: (Score:2)
Most problems can't be solved by a single map reduce. Map reduce tasks are normally written as a series of map reduce jobs; each map runs on the output of the previous reduce.
Since map reduce jobs write their output to disk at every step, it can be thought of as a form of check pointing. The difference between map reduce and mpi check pointing is that mpi needs to restart the whole job at the checkpoint, but map reduce frameworks can rerun just the work assigned to failed nodes. In the map reduce model, the
Re: (Score:1)
Uh, no..
This kind of problem isn't one where you are rendering frames (each one in parallel) or where your work quanta is large. Think of doing a finite element code on a 3d grid. The value at each cell at time t=n+1 depend on the value of all the adjacent cells at time t=n. The calculation of cell x goes bonk, and you have to stop the whole shebang.
Really? (Score:1)
Haven't these problems already been solved by large-scale cloud providers? Sure, hurricanes take out datacenters for diesel, but Google runs racks with dead nodes until it hit's a percentage where it makes sense to 'roll the truck' so-to-speak and get a tech onsite to repair the racks.
Re: (Score:1)
Pretty much also solved in all but the smallest HPC context.
They mention that overhead for fault revocery can take up 65% of the work. That's why so few large systems bother anymore. They run jobs and if a job dies, oh well the scheduler starts it over. Even if a job breaks twice in a row near the end, you still break even, and that's a relatively rare occurrence (job failing twice is not unheard of, but both times right at the end of a job is highly unlikely.
This is the same approach in 'cloud' applicat
Their methoid is nothing new. (Score:2)
This is the same method used to handle distributed computing with untrusted nodes. Simply hand off the same problem to multiple nodes and recompute if differences arise.
The real solution is going to involve hardware as well. The nodes themselves will become supernodes with built in redundancy.
Re: (Score:1)
Actually, things are much more complex, and as some other poster mentioned, these issues are the continuing subject of research, and are expected by the supercomping community since quite a few years (simply projecting current statistics, the time required to checkpoint a full-machine job is would at some point become bigger thant the MTBF...)
The PhD student mentioned seems to be just one of many working on this subject. Different research teams have different approaches, some trying to hide as much possibl
Oblig Bad Car Analogy (Score:2)
Come on now. We've known how to run V8 engines on 7 or even 6 cylinders for years now. Certainly this technology must be in the public domain by now.
I have a demonstration unit I would be happy to part with for a mere $100K.
Re: (Score:2)
Demonstration units can be had for much less :P
For example a Chrysler 300C [chrysler.com] costs $36,000, and has has a Hemi V8 with Chrysler's Multi-Displacement System [wikipedia.org]
Tandem (Score:2)
Possible Solution (Score:2)
Could they try turning it off then on again?
Not Really New (Score:4, Insightful)
The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.
Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.
On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.
Re: (Score:2)
yes, checkpointing is a reasonable answer. no large machines use Gb, though.
Re: (Score:1)
Yeah, I was just using a conservative interconnect that most people here could relate too.
Checkpointing overhead (Score:1)
The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times mo
Re: (Score:2)
The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times more nodes than most clusters use now, this overhead grows to overwhelm the amount of resources used for the actual calculation.
Thanks for spelling that out. I don't know why I missed that reading the article. Your post deserves to be modded +5 informative.
Kei (Score:2)
A machine like Kei in Kobe does live rerouting and migration of processes in the event of node or network failure. You don't even need to restart the affected job or anything. Once the number of nodes and cores exceed a certain level you really have to assume a small fraction are always offline for whatever reason, so you need to design for that.
DARPA's Exascale Report (Score:1)
DARPA's report ( http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf [gatech.edu] ) has a lot of interesting information for those who want to read more on exascale computing. I may be a bit biased being a grad student in HPC too, but the linked article didn't impress me.
ENIAC (Score:2)
Reminds me of ENIAC. I went to the Moore School at U of Penn and ENIAC parts were in the basement (may still be there, for all I know). The story was that since it was all vacuum tubes, at the beginning they were lucky to get it to run for more than 10 minutes before a tube blew.
That being said, I can't believe that supercomputers don't have some kind of supervisory program to detect bad nodes and schedule around them.
Phd with no clue... (Score:3)
Then there is the idea of a master dispatcher, which essentially breaks down the application into small chunks of tasks, and then sends those tasks to be calculated/performed on a node in the cluster. If it does not get a corresponding return value from the system it sent the task within a certain ammount of time, it re-sends to another node (and marking the other node as bad and not sending future tasks to it until that value is cleared).
Both of these methods fix the issue of having possible nodes which die on you during computation.
Re: (Score:1)
There is an amazing programming technique called "checkpointing", developed a while ago.
No shit, Sherlock. Now tell me what happens when a 2048-core job needs to write its 4TB of RAM to disk every hour.
Re: (Score:2)
redunancy solution (Score:2)
The guy in the article that says that running the redundant copies on the same nodes would reduce i/o traffic: I'd love to speak to him. There are two options I see:
1) Assuming that there is common source data that both instances need to churn on: so the data isn't redundant so what exactly are you proving by getting the same result? Diddo with CPU, integer unit etc same hardware is not a redundant solution.
2) No shared data but you are generating data on each node: so they still have to chat with eachother
Same song, next verse... (Score:1)
We have addressed issues lie this using various methods over the years. Super computing is just a current area where doing pre-emptive issue resolutions has come in play.
More obvious ways of HELPING (not single item 'resolving') this issue include:
-- Parallelism ... similar to RAID for data storage but in various related areas ... yep, make it out of 'better stuff'. If you have something 'good' make it better, or 'harder' (like radiation hardening or EMP hardening),
-- High Availability and ruggedization
--
Re: (Score:2)
How do you give the work to another node when the failed node contains the only copy of its state (like in an MPI job)? Duplicating the state on multiple nodes is way too expensive.
Re: (Score:2, Informative)
you start the job over.
You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.
On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirr
Re: (Score:2)
You can't checkpoint jobs at this scale. It will take longer to checkpoint a job then to compute an answer. This is further compounded when the job takes several months to run. A 1000 node cluster is very tiny compared to the scale they're talking about.