Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Intel Bug Security Windows Hardware Linux

Intel Skylake Bug Causes PCs To Freeze During Complex Workloads (arstechnica.com) 122

chalsall writes: Intel has confirmed an in-the-wild bug that can freeze its Skylake processors. The company is pushing out a BIOS fix. Ars reports: "No reason has been given as to why the bug occurs, but it's confirmed to affect both Linux and Windows-based systems. Prime95, which has historically been used to benchmark and stress-test computers, uses Fast Fourier Transforms to multiply extremely large numbers. A particular exponent size, 14,942,209, has been found to cause the system crashes. While the bug was discovered using Prime95, it could affect other industries that rely on complex computational workloads, such as scientific and financial institutions. GIMPS noted that its Prime95 software "works perfectly normal" on all other Intel processors of past generations."
This discussion has been archived. No new comments can be posted.

Intel Skylake Bug Causes PCs To Freeze During Complex Workloads

Comments Filter:
  • by xmas2003 ( 739875 ) * on Monday January 11, 2016 @04:04PM (#51281463) Homepage
    Old-timers will remember the Pentium 5 FDIV bug [wikipedia.org] where the chip sometimes yielded incorrect results for complex mathematical calculations.
    • by Junta ( 36770 ) on Monday January 11, 2016 @04:09PM (#51281511)

      Well 'Deja Vu' and you can leave '5' off.

      For an analogous screw up, you only need to look at Haswell/Broadwell and TSX feature, which they retroactively disabled due to defect.

      The FDIV was noteworthy because the state of things were such that they didn't have much recourse other than replacing the processors. We haven't seen a defect such that processors had to be physically recalled at such scale since, though there have been a number of similarly disastrous issues, if not for the fact they could push a microcode change to disable something or workaround...

      • The real problem with the FDIV bug is in how Intel handled it - they refused to replace an admittedly defective part unless you could show that you specifically were affected. Betting for a repeat here.
        • by whit3 ( 318913 )

          The real problem with the FDIV bug is in how Intel handled it - they refused to replace an admittedly defective part unless ...

          Well, that was the first response. Eventually, though, they bit the bullet

          "Monday, December 19 [1994] we changed out policy completely. We decided to replace anybody's part who wanted it replaced... replacing people's chips by the hundreds of thousands... We created a service network to handle the physical replacement for people who didn't ant to do it themselves."

          -- from

          • Sure, they eventually caved - but this was only after chip yields rose. The prospect of a class action forcing them to pay out the full price of every chip sold during the low-yield period at the cost originally paid would have been a LOT more (it would have also been based on chips sold, not just people who actually filed a claim).
      • by Anonymous Coward

        Well 'Deja Vu' and you can leave '5' off.

        For an analogous screw up, you only need to look at Haswell/Broadwell and TSX feature, which they retroactively disabled due to defect.

        The FDIV was noteworthy because the state of things were such that they didn't have much recourse other than replacing the processors. We haven't seen a defect such that processors had to be physically recalled at such scale since, though there have been a number of similarly disastrous issues, if not for the fact they could push a microcode change to disable something or workaround...

        That's because after FDIV, they put in a shit ton of work developing survivability features so that problems could be worked around. This is a good thing.

    • Re: (Score:3, Interesting)

      by ColdWetDog ( 752185 )

      Nah, we blame this one on the NSA, to wit:

      It only happens when running complex calculations like Mersenne primes. Who runs such calculations? It isn't the good citizens looking at their Facebook whatever it is that they look at. It's people doing crypto, ie, Terrorists.

      So how do we stop Terrorists? Don't let them do complex crypto calculations.

      QED.

    • by 110010001000 ( 697113 ) on Monday January 11, 2016 @04:14PM (#51281571) Homepage Journal
      All processors have bugs. Some are fixed and some are not. You can obtain errata sheets from the manufacturers. At least this one is easily fixable.
    • It's not a bug. It's a "specification update". Get it right. Clearly you were using the wrong specification.

    • by serviscope_minor ( 664417 ) on Monday January 11, 2016 @04:26PM (#51281703) Journal

      Old-timers will remember the Pentium 5 FDIV bug

      5? That was the 80 4.999999583694 86 processor was it not?

    • and run simultaneously on 7.9335 threads, too!

    • Don't divide: Intel Inside.
    • Old-timers will remember the Pentium 5 FDIV bug [wikipedia.org] where the chip sometimes yielded incorrect results for complex mathematical calculations.

      Does the following make sense?
      The engineers brought back the above code, because the people who knew about it and why it should not be used, had retired. This retirement situation allowed for it's re-introduction. No, Intel will not be accepting returns for Skylake. It will be a microcode patch. The microcode patch is a backdoor input to the cpu to allow fixing instructions and breaking security.

  • Too bad AMD is out of PC CPU race and Intel will got unpunished for such major flaw.
    • by Moof123 ( 1292134 ) on Monday January 11, 2016 @04:15PM (#51281593)

      If you saw the actual errata list for processors on launch day, regardless of manufacturer, your jaw would drop. A lot of nasties get cleaned up on subsequent revisions (mask changes), but in the meantime patches show up for the BIOS, libraries, and compilers so that the user never sees the warts. With Billions of transistors there will be design errors that even intel will not catch during verification or characterization. The fact that a BIOS fix will take care of it is a sign that it is not that egregious.

      If you want to avoid this kind of stuff you should wait a few months after any major shakeup to buy.

      • by Moof123 ( 1292134 ) on Monday January 11, 2016 @04:17PM (#51281611)

        Go see page 21 for example:
        http://www.intel.com/content/d... [intel.com]

        • by sinij ( 911942 )
          Surprising, I expected in-silicone code to be more robustly tested prior to getting released. Turns out, code is code.
      • Like software one should wait until after the product has had a revision 1st.

        Oddly we think of intel cpus and chipsets as rock solid and operating systems as garbage based on Vista, ME, and 8.1. Perhaps doing the same and buying older hardware would be wise too.

        My gigabyte board for example I am disappointed in and same with Asus when z97 haswell. Was new. Both are top brands but were extremely unstable and buggy. Asus Sabertooth is unusable and Gigabyte got stable after 4 updates somewhat.

        • Re: (Score:2, Interesting)

          by Anonymous Coward

          Everything is getting faster. Development cycles are getting shorter, schedules are getting tighter, margins are being trimmed down and testing is taking some of that hit. Software is already brutally paced to the point that customers are now performing QA. We're having to train our customers how to use Bugzilla and we somehow accept this as "Ok". Eventually the pacing will become so brutal that version 2 won't even use the same codebase as version 1. Posting bugs will become useless. Software development v

        • I find neither Gigabyte nor Asus to be "top" motherboard manufacturers. At best they are premium value boards (cheap boards with some premium features enabled). I have found them consistently to be buggy and sometimes even outright useless. The last time I bought them, I actually returned an Asus board because it 'supported' ECC RAM but didn't actually implement it (simply disabled it).

          I buy SuperMicro boards, not always on the edge but consistently configurable and very good support if any bugs do arise. I

          • I've found Gigabyte to be okay, but I've never understood why people like Asus so much. Their stuff is way too flaky and unreliable to command the premium prices you'll pay for it. It's too bad that Intel stopped making motherboards (at least ones in standard form factors). They generally weren't terribly friendly to overclockers and could be a bit conservative on what settings they exposed but they tended to be pretty stable and well supported.

      • by epine ( 68316 )

        The fact that a BIOS fix will take care of it is a sign that it is not that egregious.

        For a given value of performance expectation, as purchased.

        One might be a little bit cheesed to discover that the entire hardware floating point subsystem has been replaced with on chip emulator, which additionally wires down half of your L2 cache to host the microcode execution vectors and/or byte codes.

        In the spirit of good will and transparency, I hope to see Intel recirculate the original sample chips to all the hardwa

      • by antdude ( 79039 )

        That is why I never buy the (new/lat)est stuff. I'll get the old and more stable stuff.

    • you insensitive clod!

  • by RDW ( 41497 )

    It probably just means the NSA is already using your processor's compute capacity as part of their vast decryption botnet. The fix should improve resource management so you won't notice it in future.

  • by Puff_Of_Hot_Air ( 995689 ) on Monday January 11, 2016 @04:31PM (#51281757)
    This is a really interesting talk from 32c3 detailing the challenges involved in designing and verifying something as complex as a CPU where it can only be simulated at 1 Hz and costs 5 million to produce silicon for testing. https://www.youtube.com/watch?v=eDmv0sDB1Ak [slashdot.org]. The level of difficulty on getting this right just blows my mind. If it weren't for economies of scale CPU's would be completely out of reach. Also interesting in the talk is the vast number of CPU defects that are found and cataloged that most people appear to be unaware of. Most are of little importance (and hence don't get fixed), but some are fixed via code (as in this case), but there is no guarantee that these are being patched by OEM's.
    • by mikael ( 484 )

      I know the 720p version of this movie would send one Intel multi-core CPU into shutdown. That was with 3D TV and an NVidia 3D Vision setup. The same graphics boards and display had no problem with another motherboard/CPU combination. Still wondering whether it was the CPU or the cooling. No problems with anything GPU related.

      http://www.3dtv.at/Movies/Skyd... [3dtv.at]

    • by Moof123 ( 1292134 ) on Monday January 11, 2016 @05:19PM (#51282173)

      I work on ASIC design, though I am on the Analog side of things. There are more people doing verification than design by roughly 2:1. I am told that in the smaller nodes and more complex designs that the ratio is even higher. Basically you can slap down some RTL code (verilog or VHDL) quickly, but torturing it through all exceptions is very hard. Then you have to synthesize and build it, which can introduce all sorts of timing and parastic kinds of problems that have to be double checked. Finally test vectors have to be created to double check the functionality of every transistor in the design to assure that what was built matches the masks.

      It is truly phenominal that anything with Billions of gates ever works at all, let alone with the high yield and relatively low error count we have come to expect.

      • by tlhIngan ( 30335 ) <slashdot@worf.ERDOSnet minus math_god> on Monday January 11, 2016 @06:06PM (#51282595)

        I've done this.

        First, billions of transistors is actually easy - most of the transistors in a modern CPU is actually spent on caches and other memory. Logic itself doesn't have as high a transistor density as you might think. In fact, in practically all ASIC designs, there's so much extra silicon space that they put extra gates there that do nothing but are tied to a logic value. These spare transistors serve to provide "rework" room for the design. If you look at most steppings, you start with A0, then you have A1, A2, ... B0, B1, ... etc. Well, going from A0 to A1 is basically just a metal mask change - they don't change the transistor masks (each mask costs around $100K each, and 10 layer metal designs have often 30+ masks, so a $3M cost before the first silicon is patterned). instead, they rewire the transistors using this spare sea of transistors to fix the issues - hopefully only needing to change 5, maybe 10 masks tops ($1M). When you go from Ax to B0, that implies a complete new mask set - either there are too many fixes, or the design is being revised.

        As for simulation, it's multi-stage. First each block is individually tested, and simulated, then it's all brought together and software simulated to check for easy to spot faults and have full inner visibility to see why things are the way they are. The complexity of modern CPUs and SoCs means this is only around 1Hz, usually less, so it's reserved for initial testing and sanity checking test vectors.

        The next step is to put in on an accelerator - systems like Cadence's Palladium which can get your clock speeds up to the hundreds of Hz range. The simulation isn't as visible and the timings can be off, but you can functionally check most of the blocks and with careful probes design, bring error cases back to the software model to understand what's going on.

        The next stage is FPGA simulation - you're testing the logic itself and FPGAs (we're talking about the ones that cost easily $30K each, and no, you need at least 4 or 8 of them or more - that's a quarter million dollars in FPGAs!). But the system moves to the kHz range to even 1MHz. Which despite its slowness, is actually fast enough to boot an OS like Windows or Linux or run test software so software development for drivers and such can begin. Visibility is limited to whatever probes you could install and whatever debugging tools your FPGA toolset has.

        Then it's all laid out and routed and all that, and software simulations are run to verify timings - ensuring there are no setup and hold violations in the final floorplan.

        And it's not as bad as you think - each block is quite independent and as long as the interface contract is held (setup and hold, timings and other things for the block), the tools will tell you how close you are to violating the specs for each block. So you can test each block in isolation and as long as the interface contract is held, be assured it will work.

        Of course, it won't catch integration errors like ground bounce or other such things that. It's akin more to building a space shuttle or airplane - with the right design, you can get something that works.

      • by tibit ( 1762298 )

        Do you have a Czar of Bandgaps, and do you dread temperature-dependent startup problems yet? :)

  • by Anonymous Coward

    Just got a MSI with 32GB of RAM and the skylake processor because I need to manipulate large Autocad files. For no reason my laptop would lock up and nothing would be in the dump logs. I could not figure it out...until now.

    • by Anonymous Coward

      You were running windows?

    • I think you might want to look elsewhere for your problems. I've got an MSI z170a motherboard, an i7 6700K and 32GB RAM, which I use to manipulate large Autocad files... and I have had absolutely no issues at all.

    • You still haven't figured it out. You are assuming this is your problem. Unless you are an AutoCAD developer and have built the source with debugging enabled and then actually used it to single step to the offending instruction and watched the problem occur, you are still operating a (not completely unreasonable) assumption.
  • Isn't it easier to distibute new firmware with microcode_ctl/intel-microcode packages? MS-Windows also seems to have some such package updates.
    • Intel needs to push the microcode update through the BIOS. You can't do it via a OS update. So hopefully your motherboard manufacturer picks up on this.
      • Linux applies microcode updates at runtime...
      • by short ( 66530 )
        Motherboard manufacturer can do whatever they want but unless I reflash my BIOS it has no effect. And I do not regularly reflash my BIOS, do you? Besides that I still find the automatic nightly package update easier.
      • Intel needs to push the microcode update through the BIOS. You can't do it via a OS update. So hopefully your motherboard manufacturer picks up on this.

        How often does Joe Six pack update his bios? I mean really? It makes sense to patch the cpu at startup as most of these users have updates enabled by default because their computer came that way when they turned it on

      • by tibit ( 1762298 )

        No, they don't. Yes you can. Maybe - meh.

  • The CPU makes the PC freeze? If they could just crank this bug down a bit it could revolution the server cooling industry.

  • by Anonymous Coward on Monday January 11, 2016 @04:47PM (#51281861)

    Just saw this video

    https://www.youtube.com/watch?v=eDmv0sDB1Ak

    Gives some insight in to the insanely complex nature of processor design and how absurdly reliable they need to be. Modern computers pretty much expect the CPU to be flawless and that's a daunting task considering their complexity and the staggering amount of computations they perform even in ordinary day-to-day use.

    An error that occurs one in a billion operations will happen 3 times a second at 3ghz.

    So yeah. Some bugs are gonna happen. Thankfully most can be fixed with microcode updates.

    • Comment removed based on user account deletion
    • by Anonymous Coward

      Most processor bugs have nothing to do with the frequency of execution, they're caused by a unique set of circumstances. So when someone says it will happen once out of every billion operations they're making the assumption that you will setup that unique case one out of every billion times. This depends heavily on what you're doing with processor. For example, this bug is a math related operation and chances are that if you put it in one of Google or Netflix web servers it would never hit the bug for th

  • Well, count my lucky stars that OS X isn't affected! Mac master race wins again! I'm guessing there's no Prime95 mac users, so therefore I must be safe, right? right?

    On a slightly more serious note, how does one bios-update the CPU on a Mac? Does Apple roll it into their updates? Just curious.

    • by larkost ( 79011 )

      Apple calls these sorts of things "firmware updates" (yes that is a generic name). Things like this are included, as are things like updates for ethernet chipsets, firewire routers (there are 3 on the MacPro), and even rarely firmware for the GPU. Additionally there are sometimes "SMC" updates for the part of the computer that manages power and sleep behavior.

    • Correct, but for the wrong reason:
      There are currently no Apple products that utilize a Skylake CPU.

  • Comment removed based on user account deletion
  • While the bug was discovered using Prime95, it could affect other industries that rely on complex computational workloads, such as scientific and financial institutions.

    How about porn?

"The vast majority of successful major crimes against property are perpetrated by individuals abusing positions of trust." -- Lawrence Dalzell

Working...