Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Google Cloud Hardware

How Reliable Are Modern CPUs? (theregister.com) 64

Slashdot reader ochinko (user #19,311) shares The Register's report about a recent presentation by Google engineer Peter Hochschild. His team discovered machines with higher-than-expected hardware errors that "showed themselves sporadically, long after installation, and on specific, individual CPU cores rather than entire chips or a family of parts." The Google researchers examining these silent corrupt execution errors (CEEs) concluded "mercurial cores" were to blame CPUs that miscalculated occasionally, under different circumstances, in a way that defied prediction...The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests. Rather, Google engineers theorize, the errors have arisen because we've pushed semiconductor manufacturing to a point where failures have become more frequent and we lack the tools to identify them in advance.

In a paper titled "Cores that don't count" [PDF], Hochschild and colleagues Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs. "But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design," the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or as a result of physical deterioration after deployment.

Facebook has noticed the errors, too. In February, the social ad biz published a related paper, "Silent Data Corruption at Scale," that states, "Silent data corruptions are becoming a more common phenomena in data centers than previously observed...."

The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but also incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale. Hochschild recounted an instance where Google's errant hardware conducted what might be described as an auto-erratic ransomware attack. "One of our mercurial cores corrupted encryption," he explained. "It did it in such a way that only it could decrypt what it had wrongly encrypted."

How common is the problem? The Register notes that Google's researchers shared a ballpark figure "on the order of a few mercurial cores per several thousand machines similar to the rate reported by Facebook."
This discussion has been archived. No new comments can be posted.

How Reliable Are Modern CPUs?

Comments Filter:
  • AI (Score:2, Interesting)

    by Anonymous Coward
    Training networks on faulty CPUs should be fun.
    • Re:AI (Score:5, Interesting)

      by dogsbreath ( 730413 ) on Saturday June 05, 2021 @05:12PM (#61458018)

      Might never be at all apparent. An intermittent calculation error would be lost internally and would likely just modify a neural net's accuracy or some other metric. Noise.

      Depends how hard and repeatable the error becomes.

      • For an AI that in itself is built on a learning algorithm it wouldn't matter much - you can't see if it's the algorithm or hardware that failed.

        For monetary handling it really matters, a cent off on a transaction will be a legal issue.

        • I've heard of some money transaction systems that are setup to have the calculations processed on 3 different systems, then results are compared. The calculations are fairly simple in these transactions, but the monetary value for a hardware fault miscalculation could be catastrophic.

    • Training networks on faulty CPUs should be fun.

      NNs are trained on GPUs, not CPUs. They often use FP16 and sometimes even FP12, because accuracy isn't important. The training data is often noisy. An occasional flipped bit isn't going to matter.

  • by 93 Escort Wagon ( 326346 ) on Saturday June 05, 2021 @04:50PM (#61457966)

    Well there's your problem - you need to switch to git.

    • Re:Mercurial cores? (Score:5, Interesting)

      by Dutch Gun ( 899105 ) on Saturday June 05, 2021 @05:07PM (#61458006)

      I really liked Mercurial. :( But git emerged as the clear winner, and so I switched over.

      Anyhow, back on topic. I've worked on MMOs and other large-scale multiplayer games before, and it's pretty fascinating to see diagnostics come in. For instance, you may see a crash report inside a function that has no suspicious code that could even cause a crash in theory. At some point, you can do little but assume a certain percentage of machines are simply buggy (sometimes overclocked beyond its capacity), producing results that other machines wouldn't see. You're running identical code on many thousands or even millions of machines, so you can almost always tell the legitimate bugs and crashes from spurious results by the number of similar errors.

      When you see the same machine producing spurious results, it definitely points to hardware flaws. In some cases, when customers sent us bug reports, when working with them, support teams were able to help them identify their own faulty hardware.

      • Enlightening. So MMOs are highly instrumented and monitored, plus they have vocal user bases. Interesting that you are able to differentiate errors and point to hardware issues.

        • Re:Mercurial cores? (Score:5, Interesting)

          by Dutch Gun ( 899105 ) on Saturday June 05, 2021 @05:38PM (#61458096)

          There was little instrumentation on the client side - it's sort of unnecessary for an MMO, since everything the client does is, by nature, sent to the servers. There was a crash reporter, which was optional for the user to send. It sent us information about the hardware and the callstack, registers, some memory dumps, etc (all stuff you can gather through existing OS APIs) ... and it was only what we needed to help diagnose and fix the crash.

          Naturally, the server-side data is tracked comprehensively, since that's how designers figure out trends in the game, what's popular or not, and so on. And obviously, they want to track what items people are spending money on. But that shouldn't be a surprise. MMOs are live commercial services, so it would be weird if it weren't "highly monitored". Note: the only data available to the entire team is crash reports. Private data is considered privileged, and only a few specialists have access to it.

          And yes, vocal user bases. Very vocal. But that sort of passion about a game is fun as well.

          • I get it, and I was thinking "server side" wrt instrumentation. Fascinating from a sysadmin POV.

            Lots of applications don't have that level of operational insight.

  • Most production server implementations would never detect mercurial CPU errors. If the error happens to cause a machine crash, the server is simply rebooted or destroyed and re-instantiated. Quite likely errors would be simply a data corruption that may or may not be apparent to anyone. If it is visible and noticed it is doubtful that it would be attributed to a mercurial CPU. More likely to be called an unknown software bug because, after all, it couldn't be the hardware.

    How bad does the problem ha

    • Acceptable reliability depends on what the result is used for. Botching a few pixels on a video is one thing, botching an interest rate calculation at a central bank is a different thing.

      See Pentium FPU bug, this sort of thing has happened before.

      It sounds like diffusion is getting to be a problem. We know this was going to happen eventually.

      • by rpervinking ( 1090995 ) on Saturday June 05, 2021 @05:33PM (#61458084)
        From what I've been told and read, mainframe processors have extensive self-checking circuitry. Their are designed with the intention of catching almost all errors within 1 cycle, and the rest within 2 cycles. Real companies with real responsibilities, like banks, used to only use mainframes, this being one of the reasons. When companies like Google and Facebook buy vast numbers of the cheapest processors they can get, well, they get what they pay for. Google has always hyped the resilience of their approach to individual failures, but I think they always assumed failures would be hard stops, not transient miscalculations. But to reliably get hard stops when errors occur, you have to buy mainframes, not cheap chipsets.
        • Good point.

          Some exec says we can save $$$ in IT costs by using commodity computing (er, the cloud). Calculation certainty and error bounding may never enter the conversation, at least not regards calculation reliability.

          • by Anonymous Coward

            Lithography is not an exact science -- no two chips are ever exactly the same, atom by atom. There are always variations from chip to chip, and even more from wafer to wafer.

            If a code sequence can detect a problem with a chip early enough during the manufacturing process, the CPU manufacturer can easily disable the broken cores and sell it as a lessor model, or just discard that particular die entirely, before wasting time and materials on further building out the bad chip.

            Google should get in contact with

            • by dogsbreath ( 730413 ) on Saturday June 05, 2021 @10:06PM (#61458600)

              The Google research seems to say that the chips are OK when they leave the factory and they don't have any known predictors for what will eventually fail in deployed service. Also, the failures are not exactly reproducible when they are identified. Further they don't say what they know about the physics behind the failures yet beyond some hand waving worries about increased layout complexity and increased feature density. I am reminded of fused PROMs from the 1970s that suffer bit rot due to metallic migration across the blown link.

              The Google paper more or less agrees with what you say about improving production testing but given the above it may not be so easy.

              The paper lacks a lot of empirical detail. Nobody wants to say how bad the situation is and there are likely research and intellectual property issues.

               

        • Real companies with real responsibilities

          Surely those kind of companies aren't in business anymore. Gotta compete you know.

      • Yeah, reading a lot about nanometer production tech and wondering why we aren't seeing more problems. I guess we are.

        Our hardware has gone from very reliable to not quite sure.

        Then again, we have been here before. Early computer systems, say up to the mid or late 70s, had lots of issues with flakey individual transistors and components. The advent of LSI, VLSI and so on really improved things.

        We have gone through a period of extremely high reliability and are pushing the envelope again.

        • by AuMatar ( 183847 )

          On this one we may be stuck for a while longer. We're coming up against harder physical limitations than before. Of course you could always do fault tolerant computing (like say web front ends) on fast CPUs with low nm resistors and fault intolerant computing (like banking) on slower but safer higher nm transistors. It would be interesting to see how high this rate gets and how we adapt to it.

          • If you look at the ridiculous multi-patterning exposures they need to do to use 193nm light to expose modern chips you'd wonder how they make a single functioning processor.

            As EUV comes online in more plants it'll get better but even EUV with the 13nm light is behind the feature size. It's pretty much magic getting modern chips made.

      • botching an interest rate calculation at a central bank is a different thing.

        You don't need more reliable hardware to handle that. Just add some redundancy: Run the calculation on two different computers and compare the result.

        Some mission-critical systems already use redundant calculations. The flight control computers on the space shuttle were triple-redundant.

  • Cosmic Rays (Score:5, Interesting)

    by AlwayRetired ( 8182838 ) on Saturday June 05, 2021 @05:15PM (#61458026)
    Probably the same issue that memory has with random bit changes
    • Re:Cosmic Rays (Score:5, Informative)

      by AuMatar ( 183847 ) on Saturday June 05, 2021 @06:15PM (#61458152)

      Not quite. CPUs do have issues with cosmic rays, but shielding can prevent that (at some cost). This is about quantum tunneling of electrons causing electrons to flow through gates they shouldn't and causing incorrect calculations. Any fix to that is yet to be determined, other than walking back to longer transistors.

      • They don't know if it's quantum tunnelling. It could just be a transistor in the die has degraded and is too close to the margins. Look up meta-stability in flip flops to see how this can cause real problems.
      • CPUs do have issues with cosmic rays, but shielding can prevent that (at some cost).

        Cosmic rays were able to pass through the atmosphere, which has an areal density of 1 kg/cm2, roughly equivalent to a 10 m water column. I don't think that adding significant shielding is trivial, unless you think of moving data centers underground.

      • by AmiMoJo ( 196126 )

        We can fix a lot of this with some careful verification of results. Often you can verify if an answer is correct much faster than you can calculate it. You can also do things like range checking to make sure values are not too far from what you expect, even if you can't be sure they are exactly right.

        At the most most code just assumes that the calculation always produces the right result. There are specialist applications where that isn't done, e.g. some automotive stuff.

        • by AuMatar ( 183847 )

          We could, just how much are you willing to pay? Old school mainframes would actually run all computations on 3 cores, and verify results against the 3. That would triple the hardware cost and the electricity cost but would solve the problem.

    • by ytene ( 4376651 )
      You beat me to it came here to make the same suggestion.

      It does seem possible that the high density of CPU cores per unit earth’s surface, in a modern data center like the sort that Google or Facebook run [lots of blades] would be conducive to experiencing this sort of issue.

      Should be possible to eliminate with a test data center and some shield, though, right?
    • by ljw1004 ( 764174 )

      Probably the same issue that memory has with random bit changes

      From the summary, "and on specific, individual CPU cores rather than entire chips or a family of parts".

      If it were cosmic rays then there'd be a uniform distribution of errors across all cores. (I'm assuming that chips don't inherently have some cores more shielded than others).

      • Re:Cosmic Rays (Score:5, Informative)

        by Geoffrey.landis ( 926948 ) on Saturday June 05, 2021 @07:39PM (#61458352) Homepage

        If it were cosmic rays then there'd be a uniform distribution of errors across all cores. (I'm assuming that chips don't inherently have some cores more shielded than others).

        If it would be cosmic rays, you'd see the error rate dependent on altitude: more errors on processors in Denver and Salt Lake, fewer on processors in Boston and New York.

        You'd also see a variation with location on Earth, with more errors near the north and south magnetic poles, but there may not be enough machines there to collect statistics.

        • That cannot be assumed. The cosmic rays that would be energetic enough to cause a sporadic bit flip could equally pass through the earth and have energy to flip a bit on the other side. It's worthy of more study.

          It can be said that it is happening, but at what extent I don't think anybody can say with any scientific certainty.

          Besides cosmic rays, do not forget the possibility of stray intense magnetic fields or RFI (radio frequency interference).

          • That cannot be assumed. The cosmic rays that would be energetic enough to cause a sporadic bit flip could equally pass through the earth and have energy to flip a bit on the other side.

            You're thinking of neutrinos, I think. Cosmic rays are energetic... but not THAT energetic.

            It can be said that it is happening, but at what extent I don't think anybody can say with any scientific certainty. Besides cosmic rays, do not forget the possibility of stray intense magnetic fields or RFI (radio frequency interference).

        • I have seen the altitude effect directly in telco equipment. The total observed number of failures was not high enough to make a statistically well-supported statement, but anecdotally the (recoverably) failed nodes often were at higher elevations. It's a standard third semester undergrad physics problem to compute why muons from primary cosmic ray (mostly proton) collisions, can reach sea level, under a given model and empirical constants. Answer: they are relativistic, and observed flux increases with
  • AMD seems to be gaining popularity while Intel has hit problems shrinking their die size. Makes we wonder if AMD pushes their sizes down faster than Intel is because they are willing to accept these kinds of errors more frequently, while Intel's problems is that they don't want to sacrifice reliability just to shrink down their die.

    Things that make you go hmm...
    • Sorry, tons of grammar errors WHICH stinks, shouldn't post when tired.
    • by ELCouz ( 1338259 )
      Meltdown was Intel creation not AMD. Botched security at speculative execution just to be ahead of AMD in IPC. How low can they go before people notice?
    • by NateFromMich ( 6359610 ) on Saturday June 05, 2021 @05:32PM (#61458080)
      OTOH Intel is basically overclocking the crap out of their outdated process to try to compete. Both of those things are an issue.
    • I wonder if you have even the slightest clue what you're talking about. It's the other way round: Intel's attempted 10nm feature size was about 10% more aggressive than TSMC's 7nm. It didn't work out.

      • The problems are in existing hardware which is available now, which means that 7nm might not be working out as well as you think even though they are shipping the parts.
    • There is history of die shrinks causing errors, such as the delayed and poorly performing AMD HD6000 series GPU which was originally supposed to release on TSMC's 32nm process but it had issues with some of the interconnects being too small and requiring doubling to make up for that. In the end AMD opted to use the same old 40nm they were using previously, but not before wasting considerable time and money on the more expensive and less suited process.

      As others have said however, there's more to the proces

  • When you're pushing fast currents through very thin wires/circuits, you get electromigration of atoms, leading to even thinner traces and higher resistance. Then failure as those traces start to burn out.

    This is not a new problem.

  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Saturday June 05, 2021 @06:34PM (#61458204)
    Comment removed based on user account deletion
  • Pair cores for critical tasks, compare results beteen them. Mainframe style.

    • Except with 2 sources, how do you know which is correct?

      True, you could take an exception and reprocess the data over again from the start. Then hope the re-run does not fail again. I think mainframes allow this, though don't quote me. My skill with mainframes is limited to supporting RedHat on a mainframe partition. And that was more than 10 years back.

      That reminds me of the old style navigation based on time. Seafarers kept multiple time pieces, but in odd numbers, 3, 5, 7 etc.. That way, the majority
      • There's a large cost to having 3 sources instead of 2: you get only 2/3 the processing power. As the error rate is so small, detect-and-retry is good enough for almost all cases, save for hard realtime.

        And, per the article, the errors are within a small portion of cores -- so if you detect and eliminate marginal cores from your gear, the duplicated setup will remain nearly completely idle.

      • by Spit ( 23158 )

        As you point out this is a solved problem, Tandem does the same thing too IIRC. At the moment I have very little use case for 8+ cores but I do have use cases for compute assurance. The way to do it would be to compare the output of identical execution, if they don't match then try it again, one more time mark the parts failed.

  • But.... cloud.
  • Ok, I admit it. Overgrown pre-teen 42-year-old that I am, I laughed.

  • These sort of errors are bound to increase simply because we're doing much, much more work. My desktop has 24 cpus running at 3Ghz. That's capable of 6e15 calculations per day. The hardware should be detecting errors. Parity checking should at least identify half of the errors, and identify failing cores. Sensible hardware would then blow a fuse and reduce the chip to 22 cores (the cpus are hyperthreading, so IIUC they ought to fail in pairs).

    • Desktop class computers are scary, from a single-event-upset standpoint.

      "Automotive" processors (for control, not for infotainment), which are "dinosaurs" using 30-50nm processes at blazing speeds up to 300MHz or so, and are designed for harsh environments, have SEU rates on the order of one random fault per 10^7 to 10^8 hours.

      Automotive processors also typically have ECC flash and SRAM (very rarely will they even use DRAM), they will have not just parity but also ECC on on-die buses, sometimes even have EC

  • Just a guess here, but this could be affect of "Marketing MHz". That's the mode were idle chips clock at 4.1 GHz with no load, but have to slow down to 3GHz as soon as you put a load on them.

    Some genius decided: "Just hook the CPU clock to an inverse feedback loop with a thermal sensor. What could go wrong."

    It is possible to lock step entire chips and vote at the PCIe bus level -- as long as the clock isn't hooked up to a thermal sensor. Once you do that, there is no way to prove the chips work reliably

"Everything should be made as simple as possible, but not simpler." -- Albert Einstein

Working...