Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Intel Upgrades Hardware

Intel's Nehalem EX To Gain Error Correction 80

angry tapir writes "Intel's eight-core Nehalem EX server processor will include a technology derived from its high-end Itanium chips that helps to reduce data corruption and ensure reliable server performance. The processor will include an error correction feature called MCA Recovery, which will detect and fix errors that could otherwise cause systems to crash — it will be able to detect system errors originating in the CPU or system memory and work with the operating system to correct them." Update: 05/27 19:11 GMT by T : Dave Altavilla suggests also Hot Hardware's coverage of the new chip, which includes quite a bit more information.
This discussion has been archived. No new comments can be posted.

Intel's Nehalem EX To Gain Error Correction

Comments Filter:
  • by Anonymous Coward on Wednesday May 27, 2009 @02:39PM (#28113153)

    This will fix many errors affecting the processor itself (new manufacturing processes make transistors quite vulnerable to interference and aging). ECC will still be needed for correcting errors affecting data while it is stored in main memory.

    Parity will be needed for protecting caches (possibly ECC will be used in the future). Checksums for data on the hard drive. CRCs for packets on the network. And so on...

  • by Anonymous Coward on Wednesday May 27, 2009 @02:43PM (#28113207)

    No. ECC only corrects certain issues in the memory. It cannot help with memory controller errors, nor with register or TLB errors.

  • by FishBike ( 1481195 ) on Wednesday May 27, 2009 @02:44PM (#28113229)
    The article seemed pretty light on details of what MCA Recovery actually does. I found this presentation in PDF format [gelato.org] that seems to go into some more useful detail about what this is. It's not just ECC to repair single-bit errors (although that is part of it, apparently). It also includes features to recover from errors that cannot simply be corrected. For example it includes a mechanism to notify the OS of the details of an uncorrectable error, so that it could presumably re-load a page full of program code from disk, or terminate an application if its data has been corrupted, instead of shutting down the whole machine.
  • by Jah-Wren Ryel ( 80510 ) on Wednesday May 27, 2009 @03:17PM (#28113675)

    These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.

    Intel's been trying to 'protect' the market for itanium - those cpus have had it for years, probably from day 1. HP definitely markets MCA has a big feature of their itanium based systems.

    If AMD were smart, they would have incorporated it into their Opteron line just like they did x64 to cut Intel off at the knees.

  • by mzs ( 595629 ) on Wednesday May 27, 2009 @03:18PM (#28113693)

    Read the fmd, fmadm, and fmstat man pages on Solaris. There is also at least one memory scrubber kthread and you can look at memscrub_scans_done to see how far it has gone along. Lots of hardware is being checked periodically, in fact on some hardware even the FP units of the processors are periodically checked for faults. Some sparcs even have instruction retry in the case of a detected error. There is even memory mirroring on M4000 and above servers, that is like RAID-1 for memory, say a chip on a DIMM fails, you still can run, then use fmadm and replace the faulty DIMM. There are also the sorts of things you outlined above where a page is reread if not modified and only causes a SIGSEGV if that page is ever used again. In ZFS there is end to end hashing to detect and correct errors.

    Of course all of this pales to what has been available on mainframes for a generation.

  • by mzs ( 595629 ) on Wednesday May 27, 2009 @03:29PM (#28113831)

    State of the non-mainframe art with regards to RAS right now is ECC RAM with mirroring, parity cache, ECC e-cache, hashes that detect and fix multiple bit errors for storage end to end, CRC (ethernet) and cksum (TCP, UDP) (but can you trust the nic offloading engine?), instruction retry, and fp scrubbing, in addition to what has been around for the last five years or so.

  • Re:x86 (Score:5, Informative)

    by Chris Burke ( 6130 ) on Wednesday May 27, 2009 @04:02PM (#28114145) Homepage

    x86 is slow and under performing architecture, and I am surprise that Intel is bolting error correction on top of it.

    Hogwash. There's nothing inherently slow about x86. The ISA is nothing but an interface. Internally, the CISC instructions are decoded into simple micro-ops, so all the predictions about how x86 would fall behind because it wouldn't be able to have out of order execution etc were proven wrong. It's not easy to make x86 chips, but the difficult performance problems have been solved.

    So don't be surprised, it's just another step in the plain obvious trend that has been going on for over a decade now. With no performance disadvantage, and a big price advantage, x86 has been moving into the server market in a big way. The only thing holding it back is the lack of RAS features, which are just as easy to "bolt on" to x86 as any other instruction set. It's just there was no reason to add these features for desktop or low-end servers.

    The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.

    The same is true of RISC, flip a bit in the opcode field and there's a good chance it's still a valid opcode. Not that it matters one whit; flipped bits in the instruction stream are detected via ECC in the instruction cache, not by praying the decoders see it as an invalid instruction.

    This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.

    FUD like this is nothing but a stopgap measure for the RISC vendors to lose customers a little more slowly to x86 than they already are. Of course rather than just losing customers, Sun and IBM (and other former RISC vendors) sell solutions that use x86. It's only a matter of time before this trend hits even the "big iron". As x86 erodes their margins from beneath, for how long will it make sense to spend the money to develop the RISC chips for an ever-decreasing slice of the pie? Eventually it makes more sense to just demand that Intel add whatever RAS features it lacks compared to the RISC chip it'll be replacing, which is exactly what is happening here (only in this case it's EPIC that's on the chopping block).

    Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!

    Well you certainly got that right. The only real disadvantage of x86 itself is that it is a huge pain in the ass to make work properly, and a lot of the magic isn't in the ISA docs but rather in the institutional knowledge of the two remaining firms that make the chips. x86 raises the already incredibly high barrier to entry for new chip manufacturers. That, not performance or (potential) reliability, is the reason x86 sucks.

  • by Chris Burke ( 6130 ) on Wednesday May 27, 2009 @07:04PM (#28116795) Homepage

    The original Opteron had L1 ECC, it just wasn't correctable if encountered on a read or write (there was a scrubber that would find and correct ECC errors, but if it didn't reach the line in question before the program accessed the cache line, then it would detect the error and machine check fault). The ill-fated Barcelona (Phenom) added on-the-fly correctability. Phenom 2 of course has it too.

    I was pretty sure Intel had it in their L1s too. Kinda surprised to hear SPARC doesn't.

    P.S. I know The Inquirer decided it was the K10, but it isn't. They're still all K8s.

  • by Anonymous Coward on Wednesday May 27, 2009 @08:53PM (#28117779)

    I did quite a bit of work on MCA for Itanium on Linux and it's a lot harder to do than you might think. The Itanium MCA event can occur at any time, no matter what the OS is currently doing. Locks, preempt disable, interrupt disable etc., none of those will stop an Itanium MCA event from occurring.

    Whan an MCA occurs, the OS can be in any state, it may not even have a valid stack at that point. I have seen MCAs being raised right in the middle of the code that switches the cpu from one process to another or in the middle of saving the user process's state and before switching to kernel state. The only way to handle this was to define a special MCA stack frame to do the error checking and recovery on. For some scary code, see the Linux kernel, arch/ia64/mca.c and arch/ia64/mca_asm.S.

    Even after handling the stack switch problems, on Itanium you have no real idea what state the OS is in. The OS could have locks on critical code which prevent the MCA recovery from doing any useful work. MCA recovery is a nice idea but implementation is a bitch.

Living on Earth may be expensive, but it includes an annual free trip around the Sun.

Working...