Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Intel Hardware IT

Motherboard Makers Apparently To Blame For High-end Intel Core i9 CPU Failures (arstechnica.com) 57

An anonymous reader shares a report: Earlier this month, we wrote that some of Intel's recent high-end Core i9 and Core i7 processors had been crashing and exhibiting other weird issues in some games and that Intel was investigating the cause. An Intel statement obtained by Igor's Lab suggests that Intel's investigation is wrapping up, and the company is pointing squarely in the direction of enthusiast motherboard makers that are turning up power limits and disabling safeguards to try to wring a little more performance out of the processors.

"While the root cause has not yet been identified, Intel has observed the majority of reports of this issue are from users with unlocked/overclock capable motherboards," the statement reads. "Intel has observed 600/700 Series chipset boards often set BIOS defaults to disable thermal and power delivery safeguards designed to limit processor exposure to sustained periods of high voltage and frequency."

These are the specific settings that Intel believes are causing problems:
Disabling Current Excursion Protection (CEP)
Enabling the IccMax Unlimited bit
Disabling Thermal Velocity Boost (TVB) and/or Enhanced Thermal Velocity Boost (eTVB)

Additional settings which may increase the risk of system instability:
Disabling C-states
Using Windows Ultimate Performance mode
Increasing PL1 and PL2 beyond Intel recommended limits.

This discussion has been archived. No new comments can be posted.

Motherboard Makers Apparently To Blame For High-end Intel Core i9 CPU Failures

Comments Filter:
  • by herberttlbd ( 1366107 ) on Monday April 29, 2024 @01:07PM (#64433334)

    They did a video on this issue before the problems started coming out and did a follow-up on it afterward.

  • I purchased an Alienware M18 with i9-13980HX processor and Nvidia 4090RTX.
    After a 3 months, it started experiencing frequent kernel panics, reboots, and blue screens.
    The issue was eventually solved by replacing the motherboard (C9XMR) and heatsinks.

    • The article is about default BIOS configurations that cause the issues. Replacing the motherboard would give you the same default BIOS configurations, so you'd have the same crashes with the new motherboard.

      • The article is about default BIOS configurations that cause the issues. Replacing the motherboard would give you the same default BIOS configurations, so you'd have the same crashes with the new motherboard.

        Not all motherboards have the same default BIOS. Replacing a motherboard and getting the exact same defaults would mean you used the exact same model of motherboard... There are many manufacturers.

        How did that escape you?

        • Maybe you should reread the comment GrumpySteen was replying to, I don't think Alienware is going to send out random brand motherboards to fix one of their systems.

          • by Xenx ( 2211586 )
            I doubt they even could. They were talking about a laptop, which already would preclude the option. Further, Dell/Alienware usually custom design the board/case for regular computers reducing/removing any compatibility with off the shelf parts.
          • by Anonymous Coward
            You do realize that even motherboards from one specific product model can have many revisions, and especially different bios versions !
          • Maybe you should reread the comment GrumpySteen was replying to, I don't think Alienware is going to send out random brand motherboards to fix one of their systems.

            Even if the motherboard you receive after an RMA has the same hardware revision (likely it doesn't) it almost certainly won't have the same BIOS unless you updated the BIOS right before sending yours off.

      • The article is about default BIOS configurations that cause the issues. Replacing the motherboard would give you the same default BIOS configurations, so you'd have the same crashes with the new motherboard.

        It's entirely plausible that the replacement motherboard may have been shipped with an updated BIOS that has more sane defaults. Reading TFA, this appears to be the direction that Intel is pushing the motherboard manufacturers in.

      • by klubar ( 591384 )

        It's also possible that the replacement motherboard was somewhat luckier in the silicon/manufacturing lottery. If it's being pushed to the limit, a board that was a bit above spec rather than just at spec might work. Presumably the MB manufacturers sort of tested the bios settings -- but they may have done it on a different run/factory/phase of moon board than was actually shipped.

    • by r1348 ( 2567295 )

      Congratulations on your terrible purchase.

    • For information, how did you reach conclusion it was the motherboard, and not the RAM or the processor? I'm just concerned it might happen to me one day and I'll never be able to understand where the problem comes from.

      • by Targon ( 17348 )
        This issue is showing up on the high end, 13900K and KS, and the 14900K and KS. Now, when you push too much power at a chip, there is a chance the chip will "degrade", and it will not run stable without cutting back on the voltage going to the chip. The problem is that cutting back on power reduces how fast the chip can go. If the chip does not degrade, then it will run at that speed with that amount of power draw pretty much forever. Not all chips have the problem, but enough of these chips are DEGR
    • Article is about intel's approved rules and bios's bricking CPUs, not boards failing due to inadequate VRM's. If you had the same problem, then even after replacing the board your problem would persist, except that board likely is ewaste-by-design everything soldered on-board so you accidentally replaced the whole system.

    • Same family of CPUs but different product with different problems. The culprit is Z790 motherboards with CPUs ranging from 13900k-14900ks. Products you don't have with wildly different power draw.

  • by Big Hairy Gorilla ( 9839972 ) on Monday April 29, 2024 @01:45PM (#64433446)
    ...of a one month purchase/return policy. Burned out an asus i9 compiling firefox or derivative... 30 continuous minutes of 100% cpu... screen goes black.. I think there was puff of smoke... all lights went off... reboot triggers 9 beeps from the motherboard... brought it back... yup, it's 100% burned out. (about 2 years ago)

    I walked out with an AMD zen 5 series as a replacement.. also asus... with a natty oled screen... love it.... Never looked back. AMD zen seemed 20% faster, no heat issues, can pin it a 100% for as long as I want, no problem... and battery life is easily 25% longer.

    Also, don't run systemd, you'll gain another 5% battery life from that alone. <whispers.... devvv....ooo...uan.....>
    There is for certain, no use case for systemd on a laptop/desktop install. NONE.
    There is also no use case for systemd on a server, you're just Red Hat's bitch.
    Sorry, I meant, IBM's bitch.
    • When you let the smoke out of the box the game is over.
    • I walked out with an AMD zen 5 series as a replacement.. also asus... with a natty oled screen... love it.... Never looked back. AMD zen seemed 20% faster, no heat issues, can pin it a 100% for as long as I want, no problem... and battery life is easily 25% longer.

      I would like to congratulate you on your sample size of one purchase yielding good results. Meanwhile people who actually keep up to date with news know that AMD is not immune to fiery CPUs. https://www.pcgamer.com/users-... [pcgamer.com]

      • That issue was solved months ago, and it was only people running excess vSoC settings mostly due to EXPO memory kits. We've already covered the matter in comments here at least once.

      • This whole discussion thread is about hot intel cpus from this month.... Your ref is from one year ago, April 2023..
        whoopsie....try to keep up with the news, eh?

        I appreciate your compliment though.
    • by AmiMoJo ( 196126 )

      Interesting. We have been using Raspberry Pis for stuff at work, and tried Devuan. We were mostly interested in power consumption, what would be battery life for a laptop. We found that Devuan was a lot worse.

      To be fair it could have been the Pi hardware, it could have been Devuan not being optimized for it, but it was considerably worse than RPi OS which uses systemd.

      • I believe we've crossed paths on this before. Basically, for most systems I remove or don't load various system services, avahi and cups come to mind. Most of our office use cases don't require printing... but once you prune out a lot of unrequired daemons, I got the entire running process count down to around 300 (on desktop machines runnning Devuan). It's been a while since I peered into Ubuntu but iirc, after a vanilla install you'd have over 600 running processes. Processes take energy. Less processes
        • by Reziac ( 43301 ) *

          Back a few years I was wondering why Mint, being glorified Ubuntu, ran so much better than Ubuntu. Turns out Mint was running (by actual count) 1/4th as many processes. Gee, I wonder how that could impact performance...

          I didn't much like Devuan until they borrowed the PCLOS desktop and general way of doing things... now it's a lot slicker.

  • Sure (Score:3, Interesting)

    by Penguinoflight ( 517245 ) on Monday April 29, 2024 @01:51PM (#64433474) Journal

    It's the fault of the motherboard makers for using the chipsets exclusively allowed by Intel with a bios explicitly approved by Intel while following the rules drafted by Intel.

    The common failure when all your decisions must be approved by some outsider is to stop doing your own oversight. Of course the board makers should have done better, but it's 1000% intel's fault for failing to use their position to actually protect their products and customers.

    • by Anonymous Coward

      That’s a bit like Ford being responsible for you driving on the interstate in first gear at the red line.

      • As an early beta-tester (read: customer) of Ford's attempt at making a CVT transmission, please stop giving them ideas.

      • Ah but modern engines have a rev limiter. These CPUs are 'unlocked', so no rev limiter and you get to smell the smoke.
        • by Luckyo ( 1726890 )

          You don't. These CPUs aren't outright dying by burning up. What's happening is that they're degrading too fast. So they'll still work, but be stable only at slower clock speeds and power limits.

      • That's a great car analogy because it involves a car. Here's a better one.

        Ford: Buy our 500hp engines, it will allow you to drive 200mph!

        Dodge: We plan to hit 210mph by using Ford's engine. We are going to run it at 50,000rpm and will be saving costs by using no radiator.

        Ford: Sounds great, we will tell everyone to buy a Dodge!

    • Are there use cases for which these particular BIOS settings are valid? I am trying to conjure up some but can't really think of any. Turning off something like Current Excursion Protection is pretty much guaranteeing a failure. If you need to disable CEP to get acceptable performance, you need to look for much better hardware.
      • Winning benchmarks.

        • In the case of Intel, the publicity of winning benchmarks after years of losing them to AMD. Bear in mind, they did it by removing or lowering many safety settings and allowing as much power as possible. That's like a car manufacturer being able to brag that their new car model has the fastest 0-60 time on a track if the consumer buys the Enthusiast kit that allows them to turn off or remove pesky things like governors, mufflers, catalytic converters, etc. And then that manufacturer is shocked that those cu
    • It's the fault of the motherboard makers for using the chipsets exclusively allowed by Intel with a bios explicitly approved by Intel while following the rules drafted by Intel.

      Nice story, but in this case the issue was specifically *not* following the rules drafted by Intel.

      • We know that the boards were compliant because Intel won't sell chipsets to anyone who doesn't follow their rules. Hardware Unboxed was exercising extreme diligence to confirm that Intel hasn't specified which power limits they're ok with. The whole market stack is captured, which unfortunately for intel leaves them with nobody to blame but themselves.

        The only way intel could escape liability for this problem would be if some of the board makers were falsifying data to pass the conditions (like VW with th

    • by MpVpRb ( 1423381 )

      The problem is that the default settings are outside of the max set by Intel. They are NOT following the rules

    • Intel knew what they were doing. Letting the CPU be pushed that hard made for bigger bars on benchmarks, so they had no reason not to just wink and nod while the motherboard manufacturers used those Intel approved settings. If they didn't, their own product wouldn't have the bigger bars and each company knew that every other company would do exactly the same thing so even if there were issues, it wouldn't be just them.
    • by Luckyo ( 1726890 )

      It's actually more of intel defining the spec poorly. Intel has gone on record to say that their standard settings are just a recommendation, and running the CPU in infinite length turbo frequency and blasting power at it is still "in spec" because... spec only addresses overclocking via clock speed. Not increasing power fed to CPU.

      And feeding too much power to CPU for a long period of time causes rapid and permanent degradation of silicon. Which is what is happening here.

      To make matter worse, Intel is so f

    • by evanh ( 627108 )

      And because the reviews were all done with the over powered chips, Intel has effectively been false advertising to all end users.

      Intel owes everyone a refund.

    • by AmiMoJo ( 196126 )

      The important issue here is who is liable, i.e. who is going to buy you a new CPU.

      Have any of the motherboard manufacturers said they will replace dead CPUs?

    • That is a terrible way to look at it.

      The fact that Intel left room for "tweaking" settings is no excuse for the motherboard manufacturers to set up their BIOS/EFI in such a way as to fry the components.

      Your perspective allows no end room for the users. I don't like that even though I don't overclock.

  • by JBMcB ( 73720 )

    MSI is recommending users turn off CEP as it can cause CPUs to overheat.
    https://www.msi.com/blog/lower... [msi.com]

    Intel says CEP should be turned on.

    So which is it? My i9-13900KF is plenty fast without overclocking or undervolting, I mainly want it to be as stable as possible.

    • MSI was one of the makers pushing infinite power on all their boards and this advice is not current. I'd expect their aberrant results to be an indicator of some other configurations that are quite unreasonable.

      • This post was from last month, and the latest BIOS, released last week, for my mainboard adds this setting. It's been in a beta BIOS for months.

    • Yeah I thought MSI reversed that advice but I haven't been following lately. Intel is the CPU manufacturer and I'm surprised they even allow CEP to be disabled.
    • If you are not a hardcore tuner you can back off the PL2 value and disable MCE to make sure that PL values are enforced. 180W should be safe.

      If you are a tuner, you can tweak voltages and clocks in XTU or use the UEFI directly.

      • Thanks for the tips. I disabled "Enhanced Turbo" and capped the maximum power draw from ~4000W (WTF?) to 235W, the max recommended for my CPU.

        CEP is disabled by default in my UEFI settings now, so I turned that back on, too. I'll keep an eye on temperatures to see if anything goes screwy.

        • Just an FYI, but the way Intel CPUs are supposed to work when MCE (which is probably named Enhanced Turbo in your UEFI menu but without knowing your board, I can't say for sure) is disabled, is that it attempts to boost up to the PL2 value for a specific amount of time, after which point it backs off clocks and volts until power draw reaches the PL1 value. The amount of time it can stay at PL2 is controlled by a time value known as tau. Note that the latest Intel chips (since Alder Lake if I recall correct

  • RGB > magic smoke containment > sensible pricing.
  • while it would not Suprise me that it is a motherboard fault. Saying that the majority of reports are from enthusiast motherboard makers seems silly. Of course they are the majority, those buying i9 and even most of the i7 will likely be in that category and they are the ones that will be pushing their machines to the limits and beyond.
  • They hang on with AMD since some time only by consuming wildly more than PL1, and some way over PL2 if cooling allows.
    14900K can pull 400W(!) peak power with limits switched off, thatâ(TM)s insane!

    All this, just to stay roughly competitive with AMD, whether on desktop or notebooks.
    Just check tests on notebookcheck.net, where they donâ(TM)t just run one loop of the test where Intel might look good, but loop them to see how the performance is maintained. AMD usually keep the initial scores, because

  • Disabling C-States used to make CPU not go into lower power modes at Idle. I always used that on all of my Desktops and Servers in 2010-2018. Yet it did not seem to affect the maximum power consumption when the CPU is loaded 100%. Why would they all of a sudden ask not to disable that?

The unfacts, did we have them, are too imprecisely few to warrant our certitude.

Working...