Stories
Slash Boxes
Comments
typodupeerror delete not in

Comments: 154 +-   Microsoft Advice Against Nehalem Xeons Snuffed Out on Saturday November 28, @01:33PM

Posted by Soulskill on Saturday November 28, @01:33PM
from the keep-that-under-your-hat dept.
intel
microsoft
windows
hardware
Eukariote writes "In an article outlining hidden strife in the processor world, Andreas Stiller has reported the scoop that Microsoft advised against the use of Intel Nehalem Xeon (Core i7/i5) processors under Windows Server 2008 R2, but was pressured by Intel to refrain from publishing this advisory. The issue concerns a bug causing spurious interrupts that locks up the Hypervisor of Server 2008. Though there is a hotfix, it is unattractive as it disables power savings and turbo boost states. (The original German-language version of the article is also available.)"
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Broken processors (Score:5, Insightful)

    by Anonymous Coward on Saturday November 28, @01:35PM (#30255780)

    The processors are clearly broken, and anyone who bought them should get a refund or an exchange. End of story.

    • Re: (Score:2, Interesting)

      by Anonymous Coward

      We use them with Oracle VM (Xen), and they work ok.

    • All they need is a sticker that says "Windows 2008 Server Ready."
    • Re:Broken processors (Score:5, Informative)

      by Waynelson (1068550) on Saturday November 28, @04:49PM (#30256936)
      I don't know if anyone actually read the kb article on the Microsoft website, but it appears that you don't lose the power saving features and what not with the hot fix installation, the loss of those features only occurs when you directly modify the registry to disable some of the c-states in the apci system as a quick fix. Either that or i'm reading the kb article wrong.
      • by hattig (47930) on Saturday November 28, @02:33PM (#30256194) Journal

        It's pretty serious.

        Server requirements of CPUs include virtualisation and power savings (saving power in the data centre is a top priority for companies now).

        This CPU cannot do both at the same time, at least with Windows Server 2008's Hypervisor. Presumably it is being sold with both items listed as features however. I agree with the OP - the CPUs are broken as sold and advertised.

        • Re:Broken processors (Score:5, Informative)

          by Bengie (1121981) on Saturday November 28, @03:25PM (#30256490)

          so much FUD.

          #1. MS classified this interrupt as "unreliable" for all previous hypervisors and randomly decided to use it for this version of their hyper visor

          #2. ONLY MS uses this interrupt, not vmware or anyone else.

          #3. Intel's new Xeons still use less power and out perform AMD and any previous CPUs. It's still the best CPU, even if you use the "work around"

            • Re: (Score:3, Informative)

              by Anpheus (908711)

              The hotfix fixes the problem and allows the use of power saving states.

              Done!

  • AMD is looking better and this is the type of stuff that intel worshipers say amd systems do and now what will they say about intel?

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      amd is incapable of having bugs in the convoluted exception path?

      • FTFA:

        For the integrated hypervisor of Windows Server 2008 R2, Microsoft has bravely resorted to a timer function that they themselves had classified as unreliable for former processors: the timer of the Advanced Programmable Interrupt Controller (APIC). Unlike, for example, the CPU timer (Time Stamp Counter, TSC) - which by now is comparatively resistant to power-saving, SpeedStep and turbo-boost modes, but is also virtualised by virtual machines - the APIC timer can also trigger interrupts. Unfortunately, right now, the Nehalem has too many of those, so that the hypervisor falters and then stops, returning the message "Clock_Watchdog_Time-out".

        So yes, if you depend on something that generates an interrupt whose code path may be suspended in certain power-saving modes, don't be surprised if it doesn't get serviced promptly. It looks more like a bug in Windows Server.

        Back in the old days, when you issued a CLI instruction, you made sure your routine didn't do too much work before issuing an STI, because that code isn't re-entrant (it's directly modifiable by the hardware, which is why you have to use the "volatile" keyword to make sure that compilers didn't "optimize away" any loops, etc). Kind of hard to guarantee that if you're putting that portion of the hardware to sleep between interrupts. As the article points out, disabling those power-saving modes fixes the problem.

        • by AcidPenguin9873 (911493) on Saturday November 28, @02:30PM (#30256158)
          I don't think so. Here's the text from the Intel erratum:

          During a complex set of conditions, if the APIC timer is being used to generate interrupts, unexpected interrupts not related to the APIC timer may be signaled when a core exits the C6 power state. The APIC timer stops counting in C6 and as such isn't typically used to generate interrupts when the C6 core power state is enabled. Implication: Unexpected interrupt vectors could be sent from the APIC to a logical processor.

          Interrupts not related to the APIC timer being caused by the APIC timer is not a software problem, it's a hardware problem. I could understand your argument if the APIC timer was generating too many interrupts upon C6 exit, or something else related to messed-up APIC timekeeping near power management events, but this is unrelated interrupts being generated.

          I don't know the details, but I would assume Microsoft is using the APIC timer in its hypervisor for a reason. Maybe it's because the hypervisor is required to virtualize all the other timekeeping mechanisms for the guest.

      • No, it's more like [hardware manufacturer of your choice] AND [software manufacturer of your choice] are incapable of making products that are both complex, and bug-free.

        And for some reason, 'high performance' often equals 'complex'.

    • Re: (Score:3, Insightful)

      by CAIMLAS (41445)

      I wouldn't say "AMD is better", necessarily. I will say, however, that the Xeons seem to have been plagued from the very beginning with problems like this. They're just fringe enough to not get enough run-in testing, and the bugs don't get as quickly found as they do with the more mainstream/many users processors.

      • Re: (Score:2, Informative)

        by lukas84 (912874)

        Xeon is just a marketing name. The Xeon 3400 are identical with the i5-7xx, i7-8xx CPUs, the Xeon 3500 are identical with the i7-9xx CPUs and the Xeon 5500 CPUs are basically i7-9xx with two QPI Links.

        For example, this issue also affects als i5 and i7 CPUs.

    • Read the link. 5 pages of errata, and that's just headlines. Modern processors are very complicated, and they will have bugs.

      The major difference between Intel and AMD when it comes to errata is that Intel learned its lesson about secrecy from the Pentium FPU fiasco. Since then they have had a very open approach to processor bugs. AMD hasn't had such a PR disaster and isn't quite as open. That doesn't mean they are particularly less buggy.

    • by TopSpin (753) * on Saturday November 28, @05:14PM (#30257036) Journal

      AMD has also built parts with equally screwed up timers, particularly TSC clock skew on multi-cores. Timers are just messed up on x86 from either company. This nonsense goes back years. There are now at least four distinct general purpose clock sources that must be present on modern systems; tsc, apci_pm, hpet and pit (as labeled by the Linux kernel.) There will probably be further proliferation in the future as ALL of the existing timers are inadequate in subtle ways. Implementations from both manufacturers have been plagued with bugs that require nasty work-arounds; google "clocksource tsc unstable", "pm-timer bug" or "athlon x2 tsc" for some examples. This nonsense that Microsoft has stumbled upon is just the latest in a long and colorful history of failure that we'll now have to add to the list.

      Computers are supposed to keep time. Today that means high resolution clocks that work correctly regardless of power saving, concurrency, etc. Using these crucial timers is not suppose to cause spurious interrupts, bus contention or other subtle problems. People that must work with this stuff are thoroughly fed up with this ever growing pile of half-baked bullshit.

      • Re: (Score:2, Informative)

        by lukas84 (912874)

        It's a processor bug exposed by a new hypervisor technique used by MS and nobody else.

        I'm not sure why you want to blame this on MS.

        • Re: (Score:3, Insightful)

          by mysidia (191772)

          It's the equivalent to writing a program against the Windows API, not testing it, and calling the API buggy when you find that it is failing in the wild.

          The API may not match the spec perfectly, but it's your software that's buggy.

          Intel can revise the proc, or revise the spec to be in agreement.

          MS is trying to use an APIC interrupt for timing that isn't normally used for that purpose.

          It's the equivalent of attaching an alarm clock to your electric car's engine, and complaining when the idling speed

  • by chebucto (992517) * on Saturday November 28, @01:44PM (#30255824)

    Maybe Xeons are what end up being used on the UESG Marathon. I mean, half of the terminal messages on that ship are subject to the same bug. Just look at this typical example:

    http://marathon.bungie.org/story/nawmanhesclose.html#M3.13.1.1 [bungie.org]

  • by Faizdog (243703) on Saturday November 28, @01:48PM (#30255858)

    This story is interesting and timely because I plan on buying a new desktop in the next 2 weeks, just waiting for the right deal to come out, hopefully on Cyber Monday. While not getting a server, I will be getting Windows 7. I had been planning on an i7, but now am hesitant. Is there a problem with these processors for home use/gaming purposes under Windows 7? Or would I better off going with a Quad Core?

    • by Viros (1128445) on Saturday November 28, @02:08PM (#30255988)
      I've got an i7 920 on my desktop and run Windows 7 for gaming/home use purposes and it works fine. Don't let the problems with the server software dissuade you from a very good processor for home and gaming use. The kind of stuff you're describing doing will never run into anything close to the problems from this article.
      • Second. Been running the same proc with Windows 7 since RC and RTM. No probs whatsoever. I have been running VMWare for XP and encountered no issues.

    • by the linux geek (799780) on Saturday November 28, @02:10PM (#30256002)

      No, this only applies to the Hyper-V component of Server 2008 R2. Normal people do not use Windows Server for "home use/gaming purposes" (cue a dozen replies of people talking about how cool they are because they use pirated copies for said purpose), so its not a big deal. Also, Core i5/i7 is already a Quad Core, I assume you mean Core 2 Quad.

    • Looks like it's only if you're doing some virtualization. It probably wouldn't affect games.
      • by Anonymous Coward on Saturday November 28, @02:32PM (#30256188)

        No problems at all. I'm running an i7 920 with 12 GB of RAM and Windows 7 64-Bit Ultimate. I've been playing BF2, GTA4, COD:MW/MW2, Batman: AA and others without any problem. Not to mention running 2 or 3 VMWare sessions, putty sessions, winscp, IE8, pidgin and streaming TV through Windows Media Center all at the same time.

        Okay you have a big penis (not literally). We get it.

          • Re: (Score:3, Insightful)

            by cwebster (100824)

            Actaully no, IE8 is the only program you mentioned that actually needs an i7 920 and 12 gigs or ram to properly execute.

            The rest of your post is like a word problem, "Sally has 5 fish, 2 turtles and a cat. How many cats does Sally have?." That is to say, completely irrelevant to the question at hand.

            Using putty to justify a multiple core machine, quite hardcore...

      • No problems at all. I'm running an i7 920 with 12 GB of RAM and Windows 7 64-Bit Ultimate. I've been playing BF2, GTA4, COD:MW/MW2, Batman: AA and others without any problem. Not to mention running 2 or 3 VMWare sessions, putty sessions, winscp, IE8, pidgin and streaming TV through Windows Media Center all at the same time.

        But have you solved... love?

  • by bill_mcgonigle (4333) * on Saturday November 28, @01:49PM (#30255862) Homepage Journal

    Many of the benchmarking sites have also posted some poor results - I was thinking this might be a generation to skip, but now I wonder if a flaw has been discovered that could be fixed with a microcode upload. Might help the benchmarks too if it was a hidden variable.

    • Re: (Score:3, Informative)

      A generation to skip for servers (or move to AMD for a generation) but Core i7s are amazing for home/gaming use. For just about anything other than visualization and server-specific stuff, Core i7s and CPUs with the same architecture have no comparison with what AMD has to offer.
      • I'm cautiously optimistic that Nehalem-EX will be a decent server processor, at least in the 1-2 socket segment. It seems to handle multithreading quite well, and have decent FP performance. For now, though, the 6-core Opteron is king.
  • by Anonymous Coward on Saturday November 28, @01:50PM (#30255868)

    It sounds like microsoft should retract the advice and issue a warning that no OS should be run on a processor with such spurious interrupts?

    Or is this the sort of crappy hardware kernels are supposed to put up with in which case it should be Intel advising against running windows on it's hardware?

    Int€l bashing..check
    M$ bahing...check
    now i just sit back and watch the karma roll in

    • Re: (Score:3, Funny)

      by Anonymous Coward

      Uh, guy? That symbol you used is a "C" with two lines through it, not an "E". Get it right.

      • Actually, no, it isn't. In official bullshit-speak:

        Inspiration for the symbol itself came from the Greek epsilon () - a reference to the cradle of European civilisation - and the first letter of the word Europe, crossed by two parallel lines to 'certify' the stability of the euro.

        Straight from the horse's mouth [europa.eu].

        The single-stroke $-sign OTOH might just as well be an 8:

        That the dollar sign is derived from a slash through the numeral eight, denoting pieces of eight. The Oxford English Dictionary before 1963 held that this was the most probable explanation, though later editions have placed it in doubt.

        according to wikipedia [citation needed].

        If this was true it would herald a major crisis for derogatory spelling worldwide. I propose a conference to establish new and reliable standards (.) that provide sustainable ways to express our unstillable rage (..) and call attention to the seriousness of the offenses (...) committed by mega-corps (*gasping for air*) in one

  • I've been experiencing problems with intermittent lockups under VMWare as well. DL370-G6 boxes. HP has given us BIOS fixes and is even shipping new boxes, but if there's a suspect problem
    with working with MS' hypervisor, I wonder if this is the same issue?

    • Re: (Score:3, Interesting)

      by Glasswire (302197)

      Is it in response to a documented problem with VMWare ESX that HP trying to remedy with a specific BIOS change or is HP just flailing around suggesting BIOS updates as a fix to a problem they don't yet understand? There are 100s of reasons why you're having VMWare lockup issues - the ONLY similarity to MSFT issue that you seem to have is they are both hypervisors running on Nelhalem procs. Pretty thin. What does VMWare think the problem is?

  • by Anonymous Coward on Saturday November 28, @02:14PM (#30256030)
    I read the article, I read the MS support report, and I read the Intel advisory. And I don't think that the summary is correct.

    The summary says that the hotfix disables power savings and turbo boost. But my reading of the MS report is that an affected system has two options, (1) a workaround, and (2) the hotfix. The difference is that the workaround disables advanced power savings and is known to be stable without side effects, but the hotfix actually fixes the problem with the vector table, presumably by following the instructions provided in the Intel advisory note.

    Said another way, the hotfix doesn't disable power savings and doesn't disable turbo boost.

    I expect that this is another fine example where Slashdot editors misunderstand a situation. Someone prove me wrong.
    • by RDaneel2 (533639) on Saturday November 28, @02:27PM (#30256134) Homepage
      I just saw your post as I was finishing researching mine... and I certainly agree with you that the summary is wrong.

      The Microsoft KB article is quite explicit that the workaround is what disables the sleep states, leading to higher power usage - the hotfix itself does not exhibit this problem.
    • by Anonymous Coward on Saturday November 28, @02:29PM (#30256156)

      Your explanation is exactly how I interpreted the KB article. I think Slashdot was going for some sensationalistic journalism. :-)

      Taken from TFA:
      You can disable the Advance Configuration and Power Interface (ACPI) C-states by using a BIOS firmware option on the computer. If the firmware does not include this option, a software workaround is available. You can disable the ACPI C2-state and C3-state by setting a registry key. To do this, follow these steps:

            1. At a command prompt, run the following command:
                  reg add HKLM\System\CurrentControlSet\Control\Processor /v Capabilities /t REG_DWORD /d 0x0007c044
            2. Restart the computer.

      Note The computer idle power consumption will increase significantly if the deeper ACPI C-states (processor idle sleep states) are disabled. Windows Server 2008 R2 uses these deeper C-states on the Xeon 5500 series as a key energy saving feature.

      To continue to benefit from these energy saving states, remove this registry key after you install the hotfix that this article describes. To do remove this registry key, follow these steps:

            1. At a command prompt, run the following command:
                  reg delete HKLM\System\CurrentControlSet\Control\Processor /v Capabilities /f
            2. Restart the computer.

  • Actual errata (Score:3, Informative)

    by crow (16139) on Saturday November 28, @02:30PM (#30256160) Homepage Journal

    From the pdf file linked from the Intel site, I think it's AAK36, as it's the only one that mentions the word "spurious." This has to do with writing to the interrupt vector table when a local interrupt is pending. That doesn't look terribly serious from my perspective. If I'm mistaken and it's a different errata, please reply with the correction.

    • Re: (Score:3, Informative)

      by crow (16139)

      AAK36 for the Xeon version. AAN31 is the code for the i7 and i5 version. It's the same errata, just a different code number for different chips.

    • Re: (Score:3, Informative)

      I don't think it's either of them. The top one about changing vectors would be unlikely to happen in commercial software like Windows, because they would have handlers installed for all interrupts already.

      I think it issue really is the watchdog, MS is using the APIC during C6 state and as the 119 errata, the APIC counter stops during C6 state. So some interrupt that is supposed to fire to reset the watchdog doesn't fire and thus the watchdog goes off (as indicated by the error code).

      So the 119 errata is rel

  • Looks like it's a Microsoft coding problem if there is no problem in Xen or VMWare ESX Hypervisors (post on VMware above is far from useful).
    And poster didn't read the MSFT article very closely. The hotfix doesn't preclude the energy saving sleep states, it's the workaround that inhibits their use.

  • There is no evidence Intel pressured MS into their wording of the fix/workaround. It's quite possible that after not finding a fix/workaround for it and writing an initial draft saying not to use the processors, MS developed a workaround/fix (perhaps with Intel's help) that actually does work and put that in instead of saying not to use the chips.

    To those are are suddenly concerned about Intel chips because they have an errata, every chip has errata, tons of them. AMD has them too, trust me.

    I've been runnin

  • by George_Ou (849225) on Sunday November 29, @03:26AM (#30259828)
    Folks, this is a very irresponsible headline at slashdot. The Microsoft articles does NOT say hotfix breaks power save and it doesn't even mention turbo, but that it is an either or solution. Microsoft always offers workarounds as an ALTERNATIVE to the hotfix for people who don't want to apply hotfixes. The Microsoft KB article even tells you if you want to keep using those power states, then run the hotfix and make a certain modification to the registry.

    This post makes it sound like some kind of cover up and that the fix causes major CPU slowdowns, and that it's on the level of the AMD Barcelona TLB bug where the fix actually did cause a significant performance drop. This does not appear to be true. The real story is that all CPUs have hundreds of errata, and it's the job of the software maker to work around it, and that is what Microsoft is doing with their hotfix and registry hack. They're also telling you if you aren't experiencing any problems, don't bother applying the hotfix.
    • Re: (Score:3, Funny)

      by tomhudson (43916)

      Nothing to see here. Move along. What? Nevermind where I work.

      Sorry, didn't get the message - running with interrupts disabled due to too many interrupts - so Im goo@#@!%!!#)(MN!NO CARRIER

      I for one welcome our non-interrupted cpu overlords, because in Soviet Russia, interrupts disable YOU!

        • Re: (Score:3, Funny)

          by daveime (1253762)

          Thousand(s) implies at least two thousand.

          Ergo, you use each program on average for 43.2 seconds. Is this because they *all* suck, or you simply have the attention span of a concussed duckling ?

leverage, n.: Even if someone doesn't care what the world thinks about them, they always hope their mother doesn't find out.