When Mistakes Improve Performance 222
jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"
Impossible design (Score:4, Interesting)
Moving, not fixing, the problem (Score:5, Interesting)
The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.
Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.
Re:Impossible design (Score:3, Interesting)
Thats a good point. You accept mistakes with the data, but don't want the operation to change from add (where, when doing large averages plus/minus a few hundreds wont matter) to multiply or divide.
But once you have the opcode separated from the data, you can mess with the former. E.g. not care when something is a race condition because that happening every 1000th operation doesn't matter too much.
And as this is a source of noise, you just got a free random data!
Still, this looks more like something for scientific computing, and when they build the next big one that can easily be factored in. For home computing, not so much, 99% of the time they wait for user input anyhow.
Re:Moving, not fixing, the problem (Score:3, Interesting)
Re:Impossible design (Score:3, Interesting)
Actually you've got the right idea (Score:2, Interesting)
The summary talked about the communication links... I remember when we were running SLIP over two-conductor serial lines and "overclocking" the serial lines. Because the networking stack (software) was doing checksums and retries, it worked faster to run the line into its fast but noisy mode, rather than to clock it conservatively at a rate with low noise.
If the chip communications paths start following the trend of off-chip paths (packetized serial streams), then you could have higher level functions of the chip do checksums and retries, with a timeout that aborts back even higher to a software level. Your program could decide how much to wait around for correct results versus going on with partial results, depending on its real-time requirements. The memory controllers could do this, using the large, remote SRAM and RAM spaces as an assumed-noisy space and overlaying its own software checksum format on top.
This is really not so different from modern filesystems which start to reduce their trust in the storage device, and overlay their own checksum, redundancy, and recovery methods. You can imagine bringing these reliability boundaries ever "closer" to the CPU. Of course, you are right that it doesn't make sense to allow noisy computed goto addresses, unless you can characterize the noise and link your code with "safe landing areas" around the target jump points. It makes even less sense to have noisy instruction pointers, e.g. where it could back up or skip steps by accident, unless you can design an entire ISA out of idempotent instructions which you can then emit with sufficient redundancy for your taste.
A brainy idea. (Score:5, Interesting)
He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.
I dub thee neuron.
Re:Moving, not fixing, the problem (Score:3, Interesting)
OpenCL (Score:3, Interesting)
AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice.
As OpenCL and other "abuses" of GPU power become more popular, "colors a pixel wrong" will eventually happen in the wrong place at the wrong time on someone using a "gaming" card.
How eye candy helps interoperability (Score:3, Interesting)
And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
I think I know why. If free software lacks eye candy, free software has trouble gaining more users. If free software lacks users, hardware makers won't cooperate, leading to the spread of "paperweight" status on hardware compatibility lists. And if free software lacks users, there won't be any way to get other software publishers to document data formats or to get publishers of works to use open data formats.
Late, and innaccurate (Score:4, Interesting)
I've seen this before, except for an application that made more sense: GPUs. A GPU instruction is almost never critical. Errors writing pixel values will just result in minor static, and GPUs are actually far closer to needing this sort of thing. The highest-end GPUs draw far more power than the highest-end CPUs, and heating problems are far more common.
It may even improve results. If you lower the power by half for the least significant bit, and by a quarter for the next-least, you've cut power 3% for something invisible to humans. In fact, a slight variation in the rendering could make the end result look more like our flawed reality.
A GPU can mess up and not take down the whole computer. A CPU can. What happens when the error hits during a syscall? During bootup? While doing I/O?
Re:Wrong approach? (Score:4, Interesting)
What use is a blazing fast computer that is no longer reliable
Meh.. I'm pretty happy to have my brain, even if it makes some mistakes sometimes.
Re:Late, and innaccurate (Score:2, Interesting)
Re:Impossible design (Score:3, Interesting)
There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.
Have you ever tried transferring large files over a 100 MBps ethernet link? Thats right, billions of bytes over a noisy, unreliable wired link. And how often have you seen files corrupted? I never have. The link runs along extremely reliably (BER of 10^-9 I think) with as little as 12MBps out of the 100MBps spent on error checking and recovery.
Same case here. I'd expect the signal-to-noise ratio on the connects within CPUs (when the voltage is cut by say 25%) to be similar, if not better, than ethernet links. So the CPU could probably get along with lesser error checking and recovery. Or, if you choose applications (like video decoding or graphics rendering) that have no problems with a few bad bits here and there, you could manage with almost no ECC at all.
If you were to plot Error Rates vs CPU power, I'd say most modern CPUs lie at the far end of the region of diminishing returns. Theres a gold mine to be reaped by moving backwards on the curve.
Re:I'd just like to point out... (Score:2, Interesting)
Yea, uh, that's true for *any* general purpose processor. What Crusoe original promised was that this dynamically recompiled code might be either faster (by reordering and optimizing many instructions to fit Crusoe's Very Large Instruction Word design--not unlike how the Pentium Pro and above do it in hardware with multiple APU/FPU functional units) or more power efficiently (by removing the hardware parts of the reorderer/optimizer and having the software equivalent run unoften). Of course, the former just didn't hold because Intel/AMD could just pump out higher hertz processors and the latter didn't matter as much as simply underclocking the whole CPU when the system was idle (which is often enough). In short, Crusoe found two niches that both Intel and AMD cornered.
Well, that's the other part of the equation. If the Crusoe had actually provided multiple abstraction layers and not just the one (the x86 one), perhaps they could have survived. Crusoe would have been a great platform for emulating the PSX, for example. Or providing multiple, concurrent x86/Java/whatever systems to sandbox for servers--and for which the power efficiency would be important. But, then, providing well-optimized and many software solutions isn't an easy task, especially when balanced against the task of avoiding running the "Code Morphing" software as much as possible to avoid all the penalties associated with it.
In short, the problem wasn't the demand per se; it was a lack of supply. Pining the hopes of the company on a few niches of which the competitors managed to relatively quickly occupy certainly didn't help.
This branch could bear some interesting fruit... (Score:5, Interesting)
I see lots of people down on the theory - even though the original proposal was for highly-error forgiving applications - because somehow it means we can't trust the computations from the CPU anymore.
People - realize that you can't trust them NOW.
As someone who's spent way too much time in the ZFS community talking about errors, their sources and how to compensate, let me enlighten you:
modern computers are full of uncorrected errors
By that, I mean that there is a decided tradeoff between hardware support for error correction (in all the myriad places in a computer, not just RAM) and cost, and the decision has come down on the side of screw them, they don't need to worry about errors, at least for desktops. Even for better quality servers and workstations, there are still a large number of places where the hardware simply doesn't check for errors. And, in many cases, the hardware alone is unable to check for errors and data corruption.
So, to think that your wonderful computer today is some sort of accurate calculating machine is completely wrong! Bit rot and bit flipping happens very frequently for a simple reason: error rates per billion operations (or transmissions, or whatever) have essentially stayed the same for the past 30 years, while every other component (and bus design, etc.) is pretty much following Moore's Law. The ZFS community is particularly worried about disks, where the hard error rates are now within two orders of magnitude of the disk's capacity (e.g. for a 1TB disk, you will have a hard error for every 100TB or so of data read/written). But, there's problems in on-die CPU caches, bus line transmissions, SAS and FC cable noise, DRAM failures, and a whole host of other places.
Bottom line here: the more research we can do into figuring out how to cope with the increasing frequency of errors in our hardware, the better. I'm not sure that we're going to be able to force a re-write of applications, but certainly, this kind of research and possible solutions can be taken care of by the OS itself.
Frankly, I liken the situation to that of using IEEE floating point to calculate bank balances: it looks nice and a naive person would think it's a good solution, but, let me tell you, you come up wrong by a penny more often that you would think. Much more often.
-Erik
Amen to that. (Score:1, Interesting)
As a game developer, I used to not think about the possibility of hardware errors much. Until I had a very difficult-to-pin-down bug which turned out to be a hardware defect in a single L2 cache line on the console hardware. It worked fine for the first few years of its life, and then this one bit developed a "stickyness" so that sometimes it would return the wrong value and cause our software to crash.
We then ran an exhaustive RAM reading and writing test on all of the devkits our team was using, and it turned up *three more* kits that couldn't read and write correctly to all of their RAM. These are $10,000 devkits, but their reliability is about the same as the consumer hardware people have at home. Is it any surprise that consoles often need to be shipped back to MS or Sony or Nintendo and replaced? It no longer surprises me in the least.
Desktop computers may be a little ahead of consoles in reliability, but when you're doing BILLIONS of calculations every second, its inevitable that random physical quirks (cosmic ray strike or whatever) will mess one of them up sooner or later. Anyone who overclocks their PC knows that if you OC too much you start to fail the reliability tests because errors are creeping in.