Stories
Slash Boxes
Comments
typodupeerror delete not in

Comments: 226 +-   Erratum Plagues Quad-Core Opterons, Phenoms on Tuesday December 04 2007, @06:43PM

Posted by kdawson on Tuesday December 04 2007, @06:43PM
from the correct-or-fast-choose-at-most-one dept.
amd
bug
hardware
theraindog writes "Errata are not uncommon with new processors, but a problem with the TLB logic in AMD's quad-core Opteron and Phenom processors appears to be quite serious. The erratum is so severe that AMD has issued a 'stop ship' order on all quad-core Opterons. AMD has also blamed this bug for the delay of the 2.4GHz Phenom, despite the fact that the erratum is unrelated to clock speed. A BIOS-based workaround for the issue has been made available to motherboard makers, but it apparently carries a 10-20% performance penalty. What's more disturbing is that AMD knew of the erratum and the potential performance hit associated with fixing it before it launched the Phenom processor. Hardware provided to the press for reviews did not include the fix, conveniently overstating Phenom performance."
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • What??? (Score:5, Informative)

    by GregPK (991973) on Tuesday December 04 2007, @06:47PM (#21579379)
    I'm a geek an all. But, I've never heard of erratum.

    But dictionary.com is your friend.

    Design errors and mistakes in a CPU's hardwired microcode may also be referred to as an erratum. One well publicised example is Intel's "flag" erratum in early Pentium Pro processors. This made the conversion of floating point numbers to integers unreliable due to an exception not being signaled under certain conditions.
    • Re:What??? (Score:4, Insightful)

      by fitten (521191) on Tuesday December 04 2007, @07:02PM (#21579543)
      Every CPU maker publishes the errata for their CPUs because system designers/vendors/whatever need to know these things. Every CPU made for the past (insert very long time in the computer world here) has had a big list of errata publicly published. Just got to the Intel or AMD site, for example, and look up the errata on the PPro, P3, P4, Core, Core2, Athlon, Athlon XP, Athlon64, Athlon64 X2, or whatever your favorite CPU happens to be.

      The thing is, the CPU is actually broken a bit and AMD has pulled the Barcelona line but are continuing to sell the Phenom(inal Failure) line to customers and, evidently, don't plan to 'fix' the problem later (Intel offered replacements for the Pentium floating point bug after they got dinged on it, for example... I know... I had one and replaced it).

      So... if you actually get your hands on (or got your hands on) a Phenom, realize you have a broken CPU and the more you load it, the more likely you'll have stability issues.... and AMD isn't (currently) going to fix it.
        • Re:What??? (Score:5, Informative)

          by Carnildo (712617) on Tuesday December 04 2007, @08:26PM (#21580231) Homepage Journal

          Well... I can't remember any for my beloved 6502.


          They may not have been published, but there are at least three:
          1) A memory-indirect jump where the address is stored across a 256-byte boundary will read the second byte of the address from the wrong location.
          2) The arithmetic status flags are not valid when performing arithmetic in BCD mode.
          3) If a hardware interrupt occurs while the processor is fetching a BRK instruction, the BRK instruction is ignored.
            • Re: (Score:3, Informative)

              Any Futurama fan will also know that Bender's brain is a 6502, as revealed by the 'F Ray' in 'Fry and the Slurm Factory'
    • Re:What??? (Score:5, Funny)

      by alshithead (981606) on Tuesday December 04 2007, @08:19PM (#21580185)
      "I'm a geek an all. But, I've never heard of erratum."

      Mod me down, call me troll, but please don't claim to be a geek if you can claim to never have heard of erratum or errata. That's as bad as not knowing what a bug is or calling a PC case and its contents a hard drive.

      Here's a heartfelt suggestion...read more.
  • by Anonymous Coward
    Errata are very common but how company handles them is a big factor in deciding things. I certainly hope all review sites will rerun benchmarks.

    Anandtech [anandtech.com] I'm looking at you.
  • NDA for patch? (Score:5, Interesting)

    by Cajun Hell (725246) on Tuesday December 04 2007, @06:50PM (#21579413) Homepage Journal
    Check this out:

    Linux users may have another option in the form of a patch for that operating system's kernel. Sources estimate this patch's performance hit at less than one percent, but it comes with several caveats. At present, the patch purportedly only applies to the 64-bit version Red Hat Enterprise Linux, Upgrade 4. Customers must sign a non-disclosure agreement in order to obtain the patch...

    Good thing it's just a patch, as opposed to a derived work of someone else's GPLed code. I wonder what the FSF guys would say about that. I also wonder: Red Hat, why?

    • Re: (Score:3, Insightful)

      I also wonder: Red Hat, why?
      I imagine that their reasoning was that it was better to offer a patch, closed or not that benefited their users that would choose to make use of this processor. The solution isn't elegant, more like repairing an aircraft's hull with duct tape but apparently it is better than the alternatives they tried.
    • If it's a patch for the Linux kernel, which is distributed under the GPL, I don't think they can enforce an NDA. The patch may be used to create a derived work of a GPL'd product, so the derived work must also be GPL'd: so you can distribute it, as long as you include its source. This will be available for all Linux variants soon.
      • As long as the diff doesn't contain any of the original code and the patch is distributed in isolation then there is no conflict with the GPL ... if RH distributes a binary kernel though then they are in violation of the GPL, this would make RH liable but I don't know whether your rights under the GPL or the prohibitions under the NDA take precedence for the recipient though.
      • by TheThiefMaster (992038) on Tuesday December 04 2007, @07:20PM (#21579727)
        The patch is under the NDA, the kernel is under GPL, so the resulting work (patched kernel) can't be distributed, because the licenses are incompatible.

        The GPL only applies to redistribution. Private-use changes don't have to be GPL'd.

        IANAL,TIJHIUI (I Am Not A Lawyer, This Is Just How I Understand It).
        • However, the change /is/ being distributed - from red hat to the customers under NDA. Even showing code to contractors can be considered distribution (this is one of the things the GPLv3 addresses, but of course Linux is under v2)
    • Re:NDA for patch? (Score:5, Insightful)

      by Crispy Critters (226798) on Tuesday December 04 2007, @07:16PM (#21579691)
      It is silly to think that RH is ignoring the GPL.

      There are other possibilities that are more likely. For example, perhaps the patched kernel is doing something like loading microcode into the processor. The kernel code would be GPLed but the microcode would not be.

      • I doubt AMD would reveal the kernel code for altering a cpu's microcode. That would just be asking for trouble. The "patch" is more likely a call into a binary-only kernel plugin.
  • by Anonymous Coward on Tuesday December 04 2007, @06:53PM (#21579447)
    AMD can turn this into a PR boon to one-up Intel at the "Green" initiatives. All they have to do is repurpose the uncut wafers of these chips as solar panels and then retile the outside of all their buildings with the panels. This will save money on their energy bills and they can even start a new Ad Campaign:

    "AMD Outside".
  • Bummer (Score:4, Insightful)

    by El Pollo Loco (562236) on Tuesday December 04 2007, @06:55PM (#21579475) Homepage
    Wow, bad times for AMD. They're losing the war against intel, and now have another set back. A 20% performance penalty is simply unacceptable for any processor. The fact that it is for brand new ones makes it an even bigger slap in the face for consumers.
    • Re:Bummer (Score:5, Funny)

      by the_humeister (922869) on Tuesday December 04 2007, @07:02PM (#21579551)
      Hmmm... I suppose that I should disconnect this Phenom-powered computer running Windows from this nuclear power station I'm working at...
    • Wow, bad times for AMD. They're losing the war against intel, and now have another set back. A 20% performance penalty is simply unacceptable for any processor. The fact that it is for brand new ones makes it an even bigger slap in the face for consumers.

      Well, AMD doesn't sell used processors, as far as I'm aware, so where else would AMD have problems than in brand new processors? I mean, seriously, if a bug was found today in 1 GHz Durons that required a slowdown to work around, the headline wouldn't be "

    • Re: (Score:3, Interesting)

      Wow, bad times for AMD. They're losing the war against intel, and now have another set back. A 20% performance penalty is simply unacceptable for any processor. The fact that it is for brand new ones makes it an even bigger slap in the face for consumers.

      Not if the processor/mobo combo is 60% of the cost of a Intel heater.

      What are we trying to do here, compute pi to 14 million decimal paces in 5 minutes or less?

      Sooner or later AMD will come back. My experiences with Intel, is a soon as they get the lea

      • The whole "Intel is t3h hot!!!" thing has gotten old. Yes, P4s were very inefficient chips. Not so with their modern lineup. Core processors are quite efficient power wise for their given level of performance. They also scale way down, there are Core Solos with only a 3 watt TDP spec. Shouting about the Core lineup using a lot of power when it is AMD's processors that you use as the alternative makes little sense.

        It is just silly to dredge up old crap and keep using it. It actually weakens any point you try
        • Re:Bummer (Score:5, Funny)

          by scottv67 (731709) on Tuesday December 04 2007, @08:26PM (#21580225)
          What if you were doing scientific computing? 20% drop could mean a lot of time for a calculation. I use to run calculations that would take months...

          Just thinking out-loud here: Did you trying pushing-in the Turbo button?

  • by Anonymous Coward on Tuesday December 04 2007, @06:57PM (#21579483)
    In 3.... 2... 0.9999921341...
    • IC what you did there...
    • Some of the (obligatory) Pentium jokes were pretty funny. From a text file I've had laying around for quite a while:

      --------------

      Intel's new motto: "United We Stand, Divided We Fall"

      Q: How many Pentium designers does it take to screw in a light bulb?
      A: 1.99904274017, but that's close enough for non-technical people.

      Q: What do you get when you cross a Pentium PC with a research grant?
      A: A mad scientist.

      Q: What's another name for the "Intel Inside" sticker they put on Pentiums?
      A: The warning label.

      Q: What do you call a series of FDIV instructions on a Pentium?
      A1: Successive approximations.
      A2: A random number generator.

      Q: Complete the following word analogy: Add is to Subtract as Multiply is to:
              1) Divide
              2) Round
              3) Random
              4) All of the above

      Q: What algorithm did Intel use in the Pentium's floating point divider?
      A: "Life is like a box of chocolates." (Source: F. Gump of Intel)

      Q: Why didn't Intel call the Pentium the 586?
      A: Because they added 486 and 100 on the first Pentium and got
          585.999983605.

      Q: According to Intel, the Pentium conforms to the IEEE standards 754
          and 854 for floating point arithmetic. If you fly in aircraft
          designed using a Pentium, what is the correct pronunciation of "IEEE"?
      A: Aaaaaaaiiiiiiiiieeeeeeeeeeeee!

      Q: Did you hear about the new "morning after" pill being developed as a
          replacement for RU-486???
      A: Its called RU-Pentium. It causes the embryo to not divide correctly.

      TOP TEN NEW INTEL SLOGANS FOR THE PENTIUM

          9.9999973251 - It's a FLAW, Dammit, not a Bug
          8.9999163362 - It's Close Enough, We Say So
          7.9999414610 - Nearly 300 Correct Opcodes
          6.9999831538 - You Don't Need to Know What's Inside
          5.9999835137 - Redefining the PC -- and Mathematics As Well
          4.9999999021 - We Fixed It, Really
          3.9998245917 - Division Considered Harmful
          2.9991523619 - Why Do You Think They Call It *Floating* Point?
          1.9999103517 - We're Looking for a Few Good Flaws
          0.9999999998 - The Errata Inside



      Worth a laugh anyway :)
  • by statemachine (840641) on Tuesday December 04 2007, @07:02PM (#21579553)
    AMD has also blamed this bug for the delay of the 2.4GHz Phenom, despite the fact that the erratum is unrelated to clock speed. [Emphasis added.]

    Why does the summary claim this? I read through both articles, and AMD says this is a hardware issue across both chip models. Since this is a hardware issue, wouldn't it stand to reason that AMD would hold up a related chip because it's a hardware bug across both chip models and not because it's a clock speed issue? I'm not sure where the "despite" comes into play. I didn't see where the article said that AMD is not delaying a different speed Phenom.
    • indeed, I would have thought the reason why they didn't release the chip was because the bug caused it to be 10-20% slower in either case and probably affects similar chips of different clock speeds.
    • by Wavicle (181176) on Tuesday December 04 2007, @07:31PM (#21579837)
      You have to read a follow-up article to the techreport.com one here: http://techreport.com/discussions.x/13724 [techreport.com]. Which reads:

      Apparently contradicting prior AMD statements on the matter, Saucier flatly denied any relationship between the TLB erratum and chip clock frequencies. He also said there's no relationship between clock speeds and the performance degradation caused by the BIOS-based fix for the erratum.
      I imagine that is where the article got the information.
      • by mr_mischief (456295) on Tuesday December 04 2007, @08:23PM (#21580203) Journal
        IANAEE (electrical engineer) and I've never built my own CPU, even from TTLs or in a simulator. It makes sense to me, though, that while chips having the error in them may not be tied to specific clock frequencies that the chances of encountering the bug still could be.

        If it's a race condition in hardware, there's a good chance it's clock-sensitive. The bug probably exists in the whole line, sure. It'll manifest more as the clock ticks are closer together, because the margin for error without triggering the reversal of steps is smaller. If it's a matter of the wrong signal being sometimes being asserted because the edge of a clock line transition was missed, it's logically going to happen more when the clock cycles are shorter.

        A bug being in the whole line regardless of clock frequency and that bug becoming more of an issue at higher clock frequencies are not at all mutually exclusive conditions. The higher frequencies and higher rates of the error may not coincide, but there's nothing in the article to logically say they don't.

        The erratum probably does apply to the whole line equally but probably manifests as a percentage of the time in use as some function of the frequency.

        For any geek wanting a basic understanding of issues like latching times, gate propagation delays, and other analog electrical signaling issues inside a digital CPU, I recommend the first few chapters of Structured Computer Organization [isbn.nu]. The book builds upon basic designs of computers from using TTLs to designing a CPU, then up by layers through microcode, designing an assembly language, and more. I have an older edition at home which covers up through the 68030 and the 80386 as examples. The newer one covers up through the Pentium II, the UltraSparc, and the Java chips. The book won't make you an electrical engineer by any means, but the discussions of the tricky timing issues within even simple CPUs might be useful here.

        As for the clock speed not effecting the percentage loss in efficiency due to the microcode fix... well, yeah. The microcode is the same across the line regardless of the clock speed. If you insert two identical strings of instructions A1 and A2 into an identical pair of microcode stores B1 and B2, the resulting patched microcodes C1 and C2 will likewise be identical. The faster processor will decode and execute the microcode at the same clock speed as before, and so will the slower one. They'll each have the same percentage slowdown relative to their own clock speeds, because they're running the same microcode. We're not talking about two different generations of processors or even two different revisions. It's the same processor design at two clock speeds. One is going to get the same nerfs and buffs for any microcode change proportional to their clock speeds as the other.
  • Old issue, really (Score:4, Interesting)

    by Uzito (771420) on Tuesday December 04 2007, @07:15PM (#21579677)
    My good old Opteron 170 had the same stupid issue with unsynched core clocks. What is new here?
    • Re:Old issue, really (Score:5, Informative)

      by CajunArson (465943) on Tuesday December 04 2007, @07:45PM (#21579931) Journal
      The old opty 170 didn't have an L3 cache which is where the bug lies. This bug is rare, but it is reproducible when the CPU is under heavy load and was one of the reasons why AMD was trying to get hardware reviewers to come to an AMD event in Tahoe to run benchmarks on AMD approved systems instead of just dropping chips into FedEx packages. Causing a full-blown system freeze is also on the serious side when it comes to bugs. There have been even more problems, techreport has a story that unlike the hand selected systems that ran at Tahoe, many of the actual consumer phenoms you can buy today actually use slower HT speeds (1.8Ghz vs. 2.0 Ghz in the demos). This means that the memory subsystem (AMD's one theoretical strength over Intel right now) is slowed down, so the somewhat unimpressive initial results are actually overstatements of what the consumer chips can do. (article here).

          AMD is in a world of hurt right now. The "true" quad-core line appears to be nothing more than marketing hyperbole since year-old q6600's are faster clock-for-clock than Phenom is. AMD will hopefully get these bugs ironed out... by next February. Even then though, AMD will have chips that are MASSIVELY expensive to make, but that they can't sell for the higher prices Intel is able to command. AMD would be fine if they had an expensive chip they could sell at a premium, or a very cheap to produce chip they could sell for the budget crowd, but right now they have Acura production costs coupled with Kia per-unit revenues: bad times.
      • Re: (Score:3, Interesting)

        "The "true" quad-core line appears to be nothing more than marketing hyperbole"

        No, it's not marketing.

        You're not seeing the usefullness on the desktop.

        HPC is another story - and it's also the place that the plain old Opteron has been holding its own, against the faster, clock per clock, Core 2 microarchitecture.

        Having requests go through the FSB (which is a WTF this day and age) kills cache snooping, etc, between cores.

        The "true" quad core doesn't have this problem.
      • Re: (Score:3, Insightful)

        AMD would be fine if they had an expensive chip they could sell at a premium, or a very cheap to produce chip they could sell for the budget crowd, but right now they have Acura production costs coupled with Kia per-unit revenues: bad times.

        AMD actually still rules the absolute low end of the market (and has for years). Semprons ($30+) and old X2s ($60+, new retail box) are dirt cheap, and it's simply not possible to get better performance per dollar [tomshardware.com].

        There isn't much a $60 X2 can't do in your average deskt

          • Re: (Score:3, Informative)

            An excellent post, but one of your details is wrong -- The P6 was not designed in Israel. That design was done in Hillsboro, Oregon. Most of the Pentium Pros sold into the marketplace were from the "P6s", a shrink of the original design, and that was done in Folsom, California.

            The design team in Israel added the MMX instructions into the last P5 and then worked on the ill-fated Timna design (integrated memory controller with RDRAM interface) while the P6 was ramping. After that they began the low-power d
  • Let's not forget.. (Score:5, Interesting)

    by AcidPenguin9873 (911493) on Tuesday December 04 2007, @07:52PM (#21579977)

    that Intel's Core 2 also had a problem with the TLB when first released, although that problem manifested itself as data corruption instead of a lockup. Here are the two [theinquirer.net] articles [theinquirer.net] from The Inquirer about it - the second one especially. And note that this document was released after Intel had shipped the buggy Core 2's.

    However, Intel was able to fix it without incurring a large performance loss. It's a shame for AMD that they weren't able to do the same.

  • by jonwil (467024) on Tuesday December 04 2007, @08:25PM (#21580211)
    What is so bad about a company like AMD coming right out and saying "processor model x, clock speed y, stepping z has bug abc and this is the workaround for it". Assuming BIOS vendors and others are going to be deploying the fix anyway, how does it hurt AMD if everyone knows of the fix?

  • by Chris Snook (872473) on Tuesday December 04 2007, @08:36PM (#21580305)
    At least in the graphics world, "faster and usually correct" is acceptable.
  • by Ma3oxuct (900711) on Tuesday December 04 2007, @09:27PM (#21580739) Journal
    If you look at AMD's financial statements (http://sec.gov/Archives/edgar/data/2488/000119312507238299/d10q.htm#tx48043_5 [sec.gov]) for the last quarter, it has been loosing a lot of cash. This leads me to believe that they released faulty CPUs, right before the holidays, in order to get some cash in the short term.

    The idea was to gain some cash to sustain operations until a faultless (i.e. no major faults) CPU can be released. Those that bought faulty CPUs will get their CPUs replaced as soon as faultless CPUs are completed. In some sense you can look at AMD's action as taking out a long term loan.

    A counter argument to my theory can be that AMD would not risk its reputation to take out a "cash loan" in such a manner. However, the risk of losing reputation is justified if we consider another major factor at play: the holidays. It is less likely that AMD would gain the same (or even close to the same) cash flows if they would have released the CPUs after the holidays.

    AMD now has some cash and is able to breath a little bit. When it releases fixed CPUs it will be able to continue where it left off.

  • Perfect Linux CPUs (Score:4, Interesting)

    by evilviper (135110) on Tuesday December 04 2007, @10:46PM (#21581263) Journal
    Ironically, these may turn into the CPUs dejour for Linux users...

    The performance hit is probably 10% when patching the microcode which should mean steep price mark-downs on this generation of CPUs. But it's only a 1% performance hit when patching the (Linux) kernel.

    So why doesn't every OEM that sells Linux servers and desktops just buy up all of AMD's supplies of defective chips at a big discount, and pass the savings along? I'd buy a couple.
    • No. (Score:5, Funny)

      by Anonymous Coward on Tuesday December 04 2007, @06:47PM (#21579381)
      Thus concludes another episode of Short Answers To Stupid Questions.
    • by _merlin (160982) on Tuesday December 04 2007, @07:06PM (#21579577) Homepage Journal
      It's not like there aren't problems with Intel's CPUs - just take a look at the problems with the MMU in the Core 2 - but no-one is suggesting Intel is doomed. It would just be better if AMD had admitted this when they first knew about the issue rather than sending out review units that are known to have serious issues.
      • They did (Score:5, Informative)

        by DreadSpoon (653424) on Tuesday December 04 2007, @07:52PM (#21579979) Homepage Journal
        AMD admitted there were errors in the early Phenom CPUs back before launch. They even put it in their presentations in the press conferences and such. They also said before launch that they were going to include the proper fix in the revised core used in the higher end Phenom, hence the delay.
        • Re: (Score:3, Informative)

          AMD said there was a bug that only affected the 2.4GHz Phenom. Read this [theinquirer.net] and note where they say:

          AMD already issued a fix to all of its motherboard/system partners, so if you already own a 790FX motherboard or plan to buy a Phenom system, make sure to update the BIOS. 9500 (2.2 GHz) and 9600 (2.3 GHz) parts are unaffected by the errata.

          Now we learn that the slower parts were affected as well.

        • Re: (Score:3, Funny)

          AMD admitted there were errors in the early Phenom CPUs back before launch.
          They also said that the performance of the new chips would be 'phenominimal'.
      • by ceoyoyo (59147) on Tuesday December 04 2007, @08:18PM (#21580171)
        No, but AMD seems to be in a pretty delicate state. Their stock is pretty low and they've taken a beating from a newly-competitive Intel. They don't have a big advantage in processor speed anymore, nor power, nor even price. Halting shipment on an entire line? Not good. If they eventually have to recall it... bad.

        It might not be AMD's doom, but they're really not that many big screwups away.
    • It just means they're starting to make Intel's mistakes! They're on-par now! :D
    • This isn't (known to be) a security issue. Basically when the bug gets triggered, the processor just crashes. I guess you could carefully craft input to trigger it as a denial of service attack...

    • I don't know what Theo de Raadt has to do with this, I certainly did not see his reaction about this on one of the OpenBSD mailinglists. Can you at least explain what this erratum has to do with security. Because it does look like you're trolling. I do think this is not an isolated event and we can expect more and more processor bugs in the coming years. It's time to leave the antiquated x86 design behind us and move to a cleaner architecture.
When I was crossing the border into Canada, they asked if I had any firearms with me. I said, "Well, what do you need?" -- Steven Wright