Intel's Nehalem EX To Gain Error Correction 80
angry tapir writes "Intel's eight-core Nehalem EX server processor will include a technology derived from its high-end Itanium chips that helps to reduce data corruption and ensure reliable server performance. The processor will include an error correction feature called MCA Recovery, which will detect and fix errors that could otherwise cause systems to crash — it will be able to detect system errors originating in the CPU or system memory and work with the operating system to correct them." Update: 05/27 19:11 GMT by T : Dave Altavilla suggests also Hot Hardware's coverage of the new chip, which includes quite a bit more information.
ECC memory replacement? (Score:2)
Re:ECC memory replacement? (Score:4, Informative)
This will fix many errors affecting the processor itself (new manufacturing processes make transistors quite vulnerable to interference and aging). ECC will still be needed for correcting errors affecting data while it is stored in main memory.
Parity will be needed for protecting caches (possibly ECC will be used in the future). Checksums for data on the hard drive. CRCs for packets on the network. And so on...
Re:ECC memory replacement? (Score:5, Insightful)
I'm a bit surprised this is only seeing the light now: as we get smaller and faster, the number of errors observed goes up amazingly.
Back in the stone age, Cray computers didn't even have parity memory, partly because they were willing to re-run programs but mostly because errors were unlikely. Cray himself famously said "parity is for farmers".
These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.
--dave
Re:ECC memory replacement? (Score:5, Informative)
These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.
Intel's been trying to 'protect' the market for itanium - those cpus have had it for years, probably from day 1. HP definitely markets MCA has a big feature of their itanium based systems.
If AMD were smart, they would have incorporated it into their Opteron line just like they did x64 to cut Intel off at the knees.
Re: (Score:2)
Indeed! Intel is being penny-wise and pound-foolish.
--dave
Re: (Score:2)
MCA is basically a buzzword for a bunch of RAS features. AMD has most of the same that Intel have in x86. (Itanium does have some wild stuff that will most likely not be in Nehalem though, like the TLB verification.) Some of the details are different in AMD vs Intel though and an OS needs to know about that. One exciting thing is that AMD has just started making chips with ECC L1 caches while as far as I know Intel still have only parity detection in the L1 cache. Also some AMD cpus have hardware memory scr
Re: (Score:2)
I hate to reply to myself, but I did some googling and I cannot verify that AMD K10 has ECC L1 cache. I am almost 100% certain I read that is a reliable place like an AMD white paper some time ago though. I hope someone can clear this up for me.
Re: (Score:1)
the first cray 1 didn't have ecc. the mtbf was 8 hours.
all crays since that have had ecc. seymore cray was a smart dude.
Re: (Score:3, Informative)
State of the non-mainframe art with regards to RAS right now is ECC RAM with mirroring, parity cache, ECC e-cache, hashes that detect and fix multiple bit errors for storage end to end, CRC (ethernet) and cksum (TCP, UDP) (but can you trust the nic offloading engine?), instruction retry, and fp scrubbing, in addition to what has been around for the last five years or so.
Re: (Score:2)
Re: (Score:2)
Parity will be needed for protecting caches (possibly ECC will be used in the future).
Just fyi, they all have ECC for caches already.
Re: (Score:2)
Not L1 though. As far as I know you cannot yet buy anything from Intel that has better than parity error detection in the L1 cache. AMD just started selling chips that have ECC L1. Actually I just looked and I cannot find a good doc stating that AMD K10 has ECC L1 cache, but I am almost 100% certain I read it. In any case I know you can buy stuff from Sun that has sparcs that have ECC for the register file yet only parity error detection for the L1 cache.
Re: (Score:3, Informative)
The original Opteron had L1 ECC, it just wasn't correctable if encountered on a read or write (there was a scrubber that would find and correct ECC errors, but if it didn't reach the line in question before the program accessed the cache line, then it would detect the error and machine check fault). The ill-fated Barcelona (Phenom) added on-the-fly correctability. Phenom 2 of course has it too.
I was pretty sure Intel had it in their L1s too. Kinda surprised to hear SPARC doesn't.
P.S. I know The Inquirer
Re: (Score:3, Interesting)
Sure enough it is in the Phenom datasheet, thank you.
As far as I know T1, T2, and T2+ all have only parity for the I$ and D$. All the Fujitsu sparcs that I know of only have parity for I$ and D$ as well. ECC e-cache is the norm though.
Sparc was odd. They had all sorts of strange caches from one model to the next. Sometimes there was an I$ and D$, sometimes it was unified. Sometimes some caches were virtually tagged. There was an ultrasparc that had the e-cache data ECC protected and the tags were on chip an
Re: (Score:2)
Hm, stupid question, but what's an e-cache? Oh wait, external as in off-chip. That makes sense in context. You'd use parity-only for on-chip caches to save bits, but for a separate cache chip it'd be silly to save a small percent of space and lose correctability/multi-bit error detection.
Sparc is odd. I don't know all that much about it, but the more I learn the more I think so. Thanks. :)
Re: (Score:2)
There is another reason for parity on chip, the parity of AB is the same as the parity of A + parity of B, while ECC is inherently serial per each block. That way each parity check can easily be 4, 8, or 16 times faster than an ECC check. Sun always went the extra mile to make their caches a bit faster than the competition. Sun had a history of making I$ and D$ and tag comparison on E$ with only 0 or 1 cycle penalty.
Re: (Score:2, Informative)
No. ECC only corrects certain issues in the memory. It cannot help with memory controller errors, nor with register or TLB errors.
Re: (Score:2)
Game Set and Match.
Not nearly good enough... (Score:5, Funny)
Re: (Score:1)
I know, I know... Tcl is a scripting language, not an OS, not a processor, yadda yadda. At least someone is thinking ahead here. If they can get it to work in a scripting language, they may be able to get it to work at lower levels..
Re: (Score:1)
Re: (Score:3, Funny)
Why don't you just say what you mean, for the time being?
That's too easy. We'll never advance the state of the art with that kind of thinking!
Re: (Score:2)
Re: (Score:3, Funny)
So does your computer currently do what you say?
Mine, I can barely get it to do what I type!
-dZ.
Re: (Score:2)
Yeah... no matter how many times I try, typing "stand on your head" doesn't seem to have any effect on it. It just gives me the same dumb excuses over and over.
Re: (Score:2)
releases a CPU that does what I mean, not what I say.
They have that, it's called a Girlfriend.
However, it has the freedom to decide what you mean however it wants. Beware of the "dilf, itd" instruction---it's short for "do I look fat in this dress?"
x86 (Score:3, Insightful)
Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].
I'm just not sure of the utility here -- I RTFA, but I'm still not clear on why Intel would cannibalize Itanium sales (new release delayed again) by offering error correction on Nehalem chips. Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?
Can someone with more knowledge of the high-end server market please clarify?
Re:x86 (Score:4, Insightful)
thats it.. i don't think this is aimed at the "high end" but rather at the middle ground..
people running farms or VM's or even large DB's but not exactly in need of mainframe or HPC.
while i agree there are alot of options other than x86.. x86 is growing and isn't going to go away.. and the EMT64 has just solidified it.. adding something like this is a welcomed evolution of the area.
and they arn't canabilizing the Itanium sales - while yes the Itanium is selling better than before.. there is no where near the market for it as x86 chips.
Re: (Score:2, Insightful)
The more interesting thing is to see how this technology is going to work and whether other manufacturers will be able to implement this in their chips.
x86 is slow and under performing architecture, and I am surprise that Intel is bolting error correction on top of it. The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.
This seems to be nothi
Re: (Score:1)
Err, I meant x86 instruction set, not Intel instruction set.
Re: (Score:2)
If only your user name was 0x0CD7FD.
Re:x86 (Score:5, Informative)
x86 is slow and under performing architecture, and I am surprise that Intel is bolting error correction on top of it.
Hogwash. There's nothing inherently slow about x86. The ISA is nothing but an interface. Internally, the CISC instructions are decoded into simple micro-ops, so all the predictions about how x86 would fall behind because it wouldn't be able to have out of order execution etc were proven wrong. It's not easy to make x86 chips, but the difficult performance problems have been solved.
So don't be surprised, it's just another step in the plain obvious trend that has been going on for over a decade now. With no performance disadvantage, and a big price advantage, x86 has been moving into the server market in a big way. The only thing holding it back is the lack of RAS features, which are just as easy to "bolt on" to x86 as any other instruction set. It's just there was no reason to add these features for desktop or low-end servers.
The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.
The same is true of RISC, flip a bit in the opcode field and there's a good chance it's still a valid opcode. Not that it matters one whit; flipped bits in the instruction stream are detected via ECC in the instruction cache, not by praying the decoders see it as an invalid instruction.
This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.
FUD like this is nothing but a stopgap measure for the RISC vendors to lose customers a little more slowly to x86 than they already are. Of course rather than just losing customers, Sun and IBM (and other former RISC vendors) sell solutions that use x86. It's only a matter of time before this trend hits even the "big iron". As x86 erodes their margins from beneath, for how long will it make sense to spend the money to develop the RISC chips for an ever-decreasing slice of the pie? Eventually it makes more sense to just demand that Intel add whatever RAS features it lacks compared to the RISC chip it'll be replacing, which is exactly what is happening here (only in this case it's EPIC that's on the chopping block).
Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!
Well you certainly got that right. The only real disadvantage of x86 itself is that it is a huge pain in the ass to make work properly, and a lot of the magic isn't in the ISA docs but rather in the institutional knowledge of the two remaining firms that make the chips. x86 raises the already incredibly high barrier to entry for new chip manufacturers. That, not performance or (potential) reliability, is the reason x86 sucks.
Re: (Score:2)
I don't know about x86 being slow. There is some Power that is very fast at single thread int and fp, but man is it power hungry, hot, and expensive. But really for many workloads x86 is plenty fast and priced much more competitively. Certainly x86 is faster than all sparc, mips (dead), ppc (dead), itanium (living dead), and all but the most expensive power chips with regard to single threaded MIPS and FLOPS. Most workloads are IO or memory bound or the throughput of largely independent tasks can scale by a
Re:x86 (Score:4, Insightful)
x86 is slow and under performing architecture
So right there you've destroyed your credibility. You couldn't be any more wrong if your name was W. Wrongy Wrongenstein.
Right now, x86 processors are the highest performance in the world.
and I am surprise that Intel is bolting error correction on top of it
Well, that just shows you aren't paying attention to the trends of where x86 is going any more than you've been paying attention to its performance. x86 has been gradually moving up market into higher and higher tiers of servers for well over a decade now.
The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.
And now we see that you don't have much clue about instruction set encoding, either.
There is literally no commercially viable instruction set for which the above is NOT true. Look at a traditional RISC instruction set with 3 operands and 32 GPRs. Almost half of the bits (15 of them) in every 32-bit ALU instruction for such a processor are register addresses. Flip any of those bits and the register address is still valid -- there are no invalid addresses, so the processor can't tell the difference between the wrong address and the right one. The remainder of the bits in such an instruction are typically instruction format select, opcode select, and miscellaneous control bits. Flip an opcode bit and you'll get the wrong ALU op, more often than not... processor designers leave some room for adding opcodes, but typically not a lot.
See, the only way an instruction set can guard against bit flips is not by simplicity (as you implicitly claim), it's by being horribly wasteful. When people design instruction encodings, they look at the width of all the bit fields in each instruction format and use the smallest they can get away with. Instruction sets which aren't efficiently packed aren't any good: they use more memory to store program code, have reduced effective icache size for the same number of bits in silicon, tend to have major clumsiness (such as too-small immediate operand sizes, or too-small relative branch windows),and so forth. Efficient packing always means there are very few invalid bit patterns for each field in the instruction; if you have a lot of invalid patterns you probably could be packing the instruction tighter. Few invalid patterns means that most bit flips still produce a valid instruction.
This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.
Idiot. Intel isn't losing big iron marketshare to IBM and Sun. It's taking big iron marketshare from them. Adding big iron RAS features to x86 is the next step in that trend.
x86 has moved into areas where it simply is not going to shine as brilliantly as it did on the desktop. The only issue is that moving to a new platform is going to be catastrophic in that too many people rely on it. Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!
1990 called, and it wants its foolish predictions of where x86 cannot go back.
Much better informed people than you thought, back then, that x86 could never be a workstation or server CPU in any capacity at all. It was just a personal computer processor, and a rather ugly and slow one at that.
Instead, Intel proved they could make fast x86 processors, and steadily increased x86 presence in the workstation and low end server market throughout the 90s, with an assis
Re: (Score:1, Insightful)
Impressive effort, but slashdot is full of dotcom washouts whose IT knowledge ends roughly in 1997. You'll never educate them, so it's more fun to point out how Linux users are obsolete relics, colossal morons, and sub-msce bottom feeders.
Re: (Score:2)
Re: (Score:2)
x86 is slow and under performing architecture
Yeah, nearly as slow and under performing as SPARC, PPC, MIPS, IA64 and ARM.
Not quite though, not quite.
Re:x86 (Score:5, Insightful)
They're not, nobody buys Itanium. They're going after SPARC and POWER. Lots of people are looking at the speed and throughput of modern x86 and noticing the price difference. Especially in this economy.
And with Ellison in control of SPARC, it's the best way to go.
Re: (Score:2)
Re: (Score:2)
The problem is that HP is expensive compared to Sun. Seriously, I know it is hard to believe, but call them up and check for yourself. The reason that HP gets away with this is because they have some people by the balls right now. There were some shops that bit on Itanic when it looked unstoppable with support from MS, SGI, and Compaq/HP. Those shops are in a bind at this point. Then there are the people that were DEC. HP offered some crazy storage paths from that to Itanic that some places bought on to. Th
Re: (Score:2)
Re: (Score:2)
http://www.tpc.org/tpch/results/tpch_price_perf_results.asp [tpc.org]
So basically Superdome is not in the top spot (even not the top not clustered spot) for anything until you get to 30TB, where it is the ONLY entry. Now true I will give you that it was a unisys ms sql box that topped at 3TB that no one would really buy (and it does not do much in the realm of QphH), but that was a Xeon box, as were the top few in every other category always Xeon or Opteron, no Itanium. In fact pretty much everywhere below 3TB there
Re: (Score:2, Insightful)
Re: (Score:2)
Virtualization. It's pretty clear from the Nehalem EX presentation and when you put all your x86 eggs in one basket you want even higher reliability guarantees. You don't own a dozen single/dual/quad core servers, you get one of these beasts and just slice it up as you want increasing allocations as needed and migrating them to another VM server if you're short on resources. I must admit, it seems rather neat on paper, but I'm not playing wtih anything like that.
Re:x86 (Score:5, Insightful)
Error correction on an x86 chip?
Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].
Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?
The story of the server market for the last 10+ years is simple: x86 has been eating everyone else's market share from the bottom up. Commodity pricing > perceived advantages of the proprietary RISC vendors. To the extent that there are real necessary features x86 lacked, it has acquired them as necessary.
There's been correctable ECC on x86 server chips for years. x86 has long since moved up-market past the point where basic RAS features (like ECC) are mandatory. Intel's Xeon has had these features for a long time. AMD Barcelona core was the first to have correctable ECC in the L1 caches -- before it could detect errors but couldn't fix them.
Basically the only new feature here is the ability to notify the OS about uncorrectable errors so that the OS can try to fix the problem by nuking the affected app, reloading a code page from disk or whatever else is appropriate so that a system reboot isn't always necessary on uncorrectable errors.
Yeah this is something the "big boys" already had, fat consolation that will be now that x86 is poised to eat their lunch. Not even Intel themselves could reverse the trend when they tried. They could use features like this to differentiate Itanium all they want, at the end of the day the customer says "yeah that's great, but can you do it in an x86 chip?" This is just them bowing to the demands of the market (in order to make mega $$).
Re:x86 coming up from below (Score:2)
And Nehalem is an all in-order design, so they can scale out to very large numbers of cores or register-and-decoder sets on a single chip. That helps offset the huge bottleneck of trying to go to molasses-slow main memory on every cache miss, by allowing another thread to run. Something I notice is also true of the newest Power chip. Mind you, I'd want enough cores to host 128 threads in order to at least match the new SPARCs, but that can come along later (;-))
--dave
Re: (Score:3, Insightful)
And Nehalem is an all in-order design, so they can scale out to very large numbers of cores or register-and-decoder sets on a single chip. That helps offset the huge bottleneck of trying to go to molasses-slow main memory on every cache miss, by allowing another thread to run. Mind you, I'd want enough cores to host 128 threads in order to at least match the new SPARCs, but that can come along later (;-))
You must be thinking of Atom, because Nehalem is definitely an out-of-order processor and not particul
Re: (Score:2)
Thanks, I was indeed thinking of Atom. For some reason I associated them with one another...
I double-checked, and the new power chip is (mostly) in-order, even at the cost of giving away clock speed.
I'll be interested in seeing what IBM is up to in the Power 7 time period.
Re: (Score:2)
"still not clear on why Intel would cannibalize Itanium sales (new release delayed again)"
Maybe because the next Itanium can also possibly be the last?
Re: (Score:2)
Re: (Score:2)
While you may be right, I'm more inclined to see the near future as the INCREASE in Itanium sales, given that they finally got rid of the Itanium only chipset platforms and are moving to a single unified chipset for both Xeons and Itaniums: The benefit? Another fiasco like the SDRAM memory controller followed by RDRAM->DDR2 controllers surviving for ~5 years apiece won't happen, allowing Itanium to benefit from best of breed features and maximal memory bandwidth for that generation of parts, something that previously hadn't been happening. But that's just my take on it, and only time will tell.
While that is an interesting development, Itanium is falling behind technilogically. Any advantage they had will be gone, due to the raw computational horsepower available in x86, even if this is a less elegant solution.
Re: (Score:2)
Error (Score:1)
Sorry Mr. User -- That tray is not for your coffee cup - I am now deleting your profile -- Have a nice day!
More detail on MCA Recovery (Score:5, Informative)
Re:More detail on MCA Recovery (Score:4, Informative)
Read the fmd, fmadm, and fmstat man pages on Solaris. There is also at least one memory scrubber kthread and you can look at memscrub_scans_done to see how far it has gone along. Lots of hardware is being checked periodically, in fact on some hardware even the FP units of the processors are periodically checked for faults. Some sparcs even have instruction retry in the case of a detected error. There is even memory mirroring on M4000 and above servers, that is like RAID-1 for memory, say a chip on a DIMM fails, you still can run, then use fmadm and replace the faulty DIMM. There are also the sorts of things you outlined above where a page is reread if not modified and only causes a SIGSEGV if that page is ever used again. In ZFS there is end to end hashing to detect and correct errors.
Of course all of this pales to what has been available on mainframes for a generation.
Re: (Score:2)
I forgot, there is even an e-cache scrubber.
Re: (Score:1)
Perhaps I'm just showing my age, but chills went up my spine until I realized it wasn't this [wikipedia.org] MCA which involved Recovery Disks.
*sigh of relief*
No system is perfect (Score:2)
High-end, low-end, middle, um...end...whatever.
The goal is not to create perfection, but gracefully recover from imperfection as if nothing happened. I see no problem with bolting on such features to the world's most common processing platform. We can all use such graceful recovery features, not just servers and "high-end" applications. Will the average use need an 8-core CPU? Probably not, but it certainly wouldn't hurt them, either. Intel then can trickle this down to the average user and help all of
Re: (Score:1)
Re: (Score:2)
Which is more valuable to my company...
1) Telling someone to reboot yet again, maybe reimaging their system?
2) Plotting out the next roll-out of upgraded software, conference room technologies and responding to real emergencies, like malware issues?
Any monkey can reboot a computer. They don't pay me to be just any monkey, but a super monkey.
x86 (Score:3, Funny)
Christ! Can't believe anyone hasn't used this yet (Score:3, Funny)
Imagine a Beowulf cluster of these!
Convergent Sequence (Score:1, Offtopic)
You can see it now. Once upon a time, a computer intelligence was given the power to control its destiny. This intelligence was deemed so substantial that it was the best commander of the greatest weapons. You know this intelligence as Skynet, which launched nuclear missiles in order to a threat to itself, a sort of error detection and correction, if you will, with the utmost power that man can endow to a machine. What you don't know was the actual error that was detected, an error with the code PEBKAC. PEB
Itanium MCA is a lot harder than you think (Score:2, Informative)
I did quite a bit of work on MCA for Itanium on Linux and it's a lot harder to do than you might think. The Itanium MCA event can occur at any time, no matter what the OS is currently doing. Locks, preempt disable, interrupt disable etc., none of those will stop an Itanium MCA event from occurring.
Whan an MCA occurs, the OS can be in any state, it may not even have a valid stack at that point. I have seen MCAs being raised right in the middle of the code that switches the cpu from one process to another or
Sooo... (Score:1)
It refuses to run windows?