Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Intel Bug Hardware

Intel Blames 13th, 14th Gen CPU Crashes on Software Bug 59

Intel has finally figured out why its 13th and 14th generation core desktop CPUs are repeatedly crashing. From a report: In a forum post on Monday, Intel said it traced the problem to faulty software code, which can trigger the CPUs to run at higher voltage levels. Intel examined a number of 13th and 14th gen desktop processors that buyers had returned. "Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor," it says. But in some bad news, Intel still needs a few more weeks to test its fix for the problem. "Intel is currently targeting mid-August for patch release to partners following full validation," it says. The company also recently confirmed that the issue doesn't extend to its mobile processors.
This discussion has been archived. No new comments can be posted.

Intel Blames 13th, 14th Gen CPU Crashes on Software Bug

Comments Filter:
  • by Luckyo ( 1726890 ) on Tuesday July 23, 2024 @11:28AM (#64649238)

    Running CPU on unexpectedly high voltage degrades silicon over time. Since a lot of these CPU were likely being fed too much voltage for prolonged periods of time, they may be permanently degraded.

    I guess intel is still trying to figure out everything involved in the issue, since this clearly was a difficult one to troubleshoot due to its intermittency. But it's going to be interesting to see how it will choose to make its customers whole, as its reputation has taken quite a beating because of this issue.

    And we really need a healthy intel so that it continues to compete with AMD and both companies continue innovating.

    • by AmiMoJo ( 196126 ) on Tuesday July 23, 2024 @11:34AM (#64649258) Homepage Journal

      Extended warranty at a minimum.

      I wonder though, is this really "too much voltage", or is it "we pushed the turbo boost too high to get better benchmark scores, and are going to drop performance a bit"?

      • by Anonymous Coward
        Intel is a chip company. And in my experience dealing with chip designers:

        Software problem: Anything that can be covered up by software.

        There are a lot of hardware bugs that have to be worked around by software, but since hardware is 'hard' and software is 'easy' it becomes a software problem,
    • by Anonymous Coward

      In some cases yes, it's permanent.

      The question is, how do we know if our CPU is damaged or not or if a problem will show up later?

      I noticed my 13th gen was unstable from the day I got it back in 2022.

      • Did you return it as defective? If not then you are just part of the problem: customers are sheep.
      • by mysidia ( 191772 )

        The question is, how do we know if our CPU is damaged or not or if a problem will show up later

        CPU Stress testing. Preferably both before and after the update.

        The caveat is that a CPU is a complex beast with 12 billion+ transistors. It's possible that there can be damage, and the chosen stress test fails to thoroughly exercise or fails to expose incorrect operation of the portion which may be damaged.

    • And we really need a healthy intel so that it continues to compete with AMD and both companies continue innovating.

      Intel hasn't been "healthy" since AMD64 was caused to be released by Intel's lack of health back in the early 2,000's. They are a lazy almost-monopoly, like Boeing... and just as healthy. If you are pinning your hopes on innovation from Intel, you are barking up the wrong tree sir. Monopoly profits are to be preserved, nobody cares about innovation.

      You are better off letting Intel finish rotting than trying to save them. They literally can not think of anything other than profit at this point.

  • Tier 1 techs and another YouTube tech channel was accusing Intel of trying to hide a design flaw in the chips. I thought it was a bit of a brash accusation at the time and wasn't really backed up by any concrete evidence. Hopefully they come out with updates to the test out the fixes (once released) now that Intel has released this info.
  • This is ONE OF THE causes. I guess a software issue can technically result in a lack of proper insulating coating at the atomic level across oxidizable copper, because that's one of the causes. My piece of shit 14700k can't even maintain a RAM speed above 6000MHz on a $300 board so hopefully this resolves it.
  • Nothing against Intel, but I am glad I recently switched and use an AMD Ryzen 9 7950X CPU. It works great for fully modded Skyrim and Baldur's Gate 3. I am glad we now have choices.

  • by mmell ( 832646 ) on Tuesday July 23, 2024 @11:36AM (#64649268)
    Division is futile. You will be approximated.
    • if they could change the microcode back then, they would have made every float point division decode into a slower by some number of clocks worth of latency set of u-ops (another iteration of newton or somtn) that fixes the output rather than do a recall

      and there isnt a 'use more voltage this time' u-op .. intel pr team is spinning a lot of yarn
  • by Jeslijar ( 1412729 ) on Tuesday July 23, 2024 @11:40AM (#64649292) Homepage

    If they admitted their designs were poorly engineered they would be subject to at least an 8 figure class action suit in the US which would have no defense. They can't admit fault and avoid liability with something like this.

    Given that you can simply swap out a used chip for a brand new identical model chip and it works perfectly fine the damage is not going to be reversible. They're just gonna nerf the power numbers and leave us with a crappier chip because the ones which don't run at full speed (e.g. cheaper skus) do not degrade as quickly. This will result in fewer warranty replacements at the cost of chips that do not perform as advertised. They'll never admit that the slower chips are affected either because then the affected class will expand to the mainstream skus.

    The benchmarks post-patch will let us know how bad the nerf is. Long term this problem will never go away.

  • If this is resolved with a software patch, then presumably the chips remain forever vulnerable to malicious software deliberately causing the same issue. In other words, potentially malware can now actually fry your processor.

    I am not a chip designer or anything even close, but it seems to me that it would be kind of critical to have something like thermal and power limiting to be handled on-chip, in hardware, in a way that absolutely cannot be worked around with software. The hardware shouldn't be capabl

    • by UnknowingFool ( 672806 ) on Tuesday July 23, 2024 @11:54AM (#64649322)

      If this is resolved with a software patch, then presumably the chips remain forever vulnerable to malicious software deliberately causing the same issue. In other words, potentially malware can now actually fry your processor.

      Most likely the problem is in the Intel microcode [wikipedia.org] not 3rd party code like in Cyber Punk 2077.

      I am not a chip designer or anything even close, but it seems to me that it would be kind of critical to have something like thermal and power limiting to be handled on-chip, in hardware, in a way that absolutely cannot be worked around with software. The hardware shouldn't be capable of operating in a way that destroys itself physically.

      Again, Intel is most likely talking about microcode.

      • by Gilmoure ( 18428 )

        And it's too hot to wear trench coats.

      • by Archtech ( 159117 ) on Tuesday July 23, 2024 @12:13PM (#64649374)

        "Not a Defect: Intel Blames 13th, 14th Gen CPU Crashes on Software Bug"

        That's PC Mag's headline. Now microcode comprises a vital part of any modern processor. Presumably the relevant microcode is supplied by Intel and ships with the chip - that is, in the chip or the driver.

        Either Intel's own microcode is faulty, or it unwisely allows outsiders to reprogram the microcode of its processors.

        In both cases Intel is at fault. If the trouble lies in Intel's microcode then it is indeed a defect. Software defects are just as real as hardware defects.

        • Not all defects are equal. A bug that can be patched isn't typically considered a defect, and up until now many have thought a hardware component was at fault. Microcode is Intel's responsibility. Bugs are common in every CPU. Intel, AMD, ARM, all publish errata for each chip generation usually listing upwards of 100 bugs in each CPU, many of them resolved silently by microcode updates as part of OS updates.

          • by UnknowingFool ( 672806 ) on Tuesday July 23, 2024 @01:07PM (#64649516)
            Also those are the bugs that are caught by the chip maker before production. If they are not caught by then, they will appear when customers get the chips. A former Intel engineer claims that Skylake QA being so bad was the final straw for Apple. [pcgamer.com] At one point, Apple was finding as many bugs in Skylake as Intel when testing the chips. That was not a good look for Intel.
          • Up until now? It's clear that hardware IS at fault, and that Intel is trying to hide the problem. They will issue new microcode which tries to do this, and which will degrade performance whether it succeeds or not. This is their fourth guess as to what the problem is!

            • It's clear that hardware IS at fault

              No it's not clear in the slightest. If the fix is attributed to the microcode bug then it's a firmware fault.

              This is their fourth guess as to what the problem is!

              And if it's right this time will you eat an all you can eat buffet of humble pie? probable not, just come up with conspiracy theories from the sidelines. The only thing truly clear here is that nothing can be clear to you because you lack the information others have.

              • And if it's right this time will you eat an all you can eat buffet of humble pie?

                That depends, will the performance plummet like the usual Intel fixes?

                The only thing truly clear here is that nothing can be clear to you because you lack the information others have.

                We all have the same information, except whoever inside Intel has been lying to us about whether they have a working fix. So far they've claimed four times that they knew what the problem was, and it wasn't their fault, and they've fixed it; We know they were lying three of those times, and we're waiting to find out for sure about the fourth. But since we know they were lying 75% of the time so far, the safest bet is that they're lying ag

    • by Anonymous Coward
      The malware would have to be able to modify the processor's microcode (which is where the software bug is). Any malware that can do that can damage any modern processor.
  • Bs (Score:2, Insightful)

    by DrMrLordX ( 559371 )

    The claims don't fit the failure patterns. We'll have to wait to see if these fixes actually solve anything. Also this bug should have been detectable with external voltage monitoring.

    • The 'fix' will let the processors operate just long enough to go out of warranty before the problem kills them.

    • Re:Bs (Score:4, Insightful)

      by fuzzyfuzzyfungus ( 1223518 ) on Tuesday July 23, 2024 @02:17PM (#64649732) Journal
      Intel has had "fully integrated voltage regulators" on-package(parts are on die; some capacitive and inductive elements are on-package) since Haswell. If the microcode issue affects the behavior of those voltage regulators it could be significantly subtler than just watching what it's pulling from the motherboard.

      Still the sort of thing that you'd think at least engineering samples would have test points for, or some sort of monitoring interface; but potentially a lot more fiddly than the original "it must be those dastardly gamers with their dodgy motherboards" theory of overvoltage.
      • Not every desktop Intel CPU has FIVR. Raptor Lake does had some variation of an IVR, but you can still monitor the VRM for power fluctuations that don't line up with voltage/current levels reported by on package sensors.

  • We will have to wait and see.
    Or, Intel might be telling us what we want to hear to buy more time.
    • amd is about to launch its next gen

      Intel has been fucked for years so this is just a symptom. Intel is the next Motorola. You just cant compete as a vertically integrated fab anymore.
  • Software "flaw" my ass. I'm sure the logic requesting extra voltage had nothing to do with Intel trying to eek out every inch of performance in the highly competitive market against AMD. They probably hoped running hot wouldn't destroy the chips, at least in the short half life modern CPU's have in an upgrade-crazed market. Now that the chips are dying let's call that reach for performance a software "flaw". Let's see what the performance looks like after they fix the glitch.
    • since when can microcode "request more voltage"

      decoder breaks up instructions into micro-operations, u-ops. There isnt a "do this one extra hard" u-op.
      • do we know intel microcode. Is that public information?

        When you use vector processing the processor underclocks to compensate for additional power draw. may be that need to be able to adjuste voltage?

      • Microcode is more than just rearranging instructions. It controls the entire function of the CPU which includes how power is drawn.

  • I get that Slashdot uses the headlines that come from the source of a story, but anyone who thinks about it should realize that just because you can fix a problem by changing things in software, that does NOT necessarily mean that the issue itself is in software.

    Intel is addressing the issue in microcode. That doesn't mean it's a "Software Bug". We're techies and should know better than to post this kind of lowest-common-denominator story.

  • But in some bad news, Intel still needs a few more weeks to test its fix for the problem. “Intel is currently targeting mid-August for patch release to partners following full validation,” it says.

    Maybe Intel should invite should invite some people from ClownStrike to show them how to do QA.

  • Not software (Score:4, Insightful)

    by paradigm82 ( 959074 ) on Tuesday July 23, 2024 @04:02PM (#64650060)
    The use of 'software' suggests it is something external to the CPU (you're using it wrong). I wouldn't call microcode 'software'. It may share some characters of software (updateable etc.) but it is still a completely internal part of the CPU inaccessible to anyone but Intel. Ultimately most of the CPU is realized in the form of VHDL code - so ultimately everything can be described as a programming/coding error. Further, the error mode is not clear. Is it A) "Microcode bug requests too high voltage which makes the CPU unstable while the high voltage level is present, but effect is reversed by removing the high voltage, or B) "Microcode bug requests too high voltage, which wears down the CPU faster than expected, making it permanently less stable even with the high voltage gone"?

I have a very small mind and must live with it. -- E. Dijkstra

Working...