Forgot your password?
typodupeerror
Hardware News Technology

When Mistakes Improve Performance 222

Posted by kdawson
from the let's-change-everything dept.
jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"
This discussion has been archived. No new comments can be posted.

When Mistakes Improve Performance

Comments Filter:
  • Impossible design (Score:4, Interesting)

    by ThatMegathronDude (1189203) on Saturday May 29, 2010 @06:24PM (#32392416)
    If the processor goofs up the instructions that its supposed to execute, how does it recover gracefully?
  • by Red Jesus (962106) on Saturday May 29, 2010 @06:35PM (#32392494)

    The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.

    Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.

  • Re:Impossible design (Score:3, Interesting)

    by Anonymous Coward on Saturday May 29, 2010 @06:44PM (#32392594)

    Thats a good point. You accept mistakes with the data, but don't want the operation to change from add (where, when doing large averages plus/minus a few hundreds wont matter) to multiply or divide.

    But once you have the opcode separated from the data, you can mess with the former. E.g. not care when something is a race condition because that happening every 1000th operation doesn't matter too much.
    And as this is a source of noise, you just got a free random data!
    Still, this looks more like something for scientific computing, and when they build the next big one that can easily be factored in. For home computing, not so much, 99% of the time they wait for user input anyhow.

  • by Turzyx (1462339) on Saturday May 29, 2010 @06:49PM (#32392636)
    I'm making assumptions here, but if these errors are handled by software would it not be possible for a program to 'ignore' errors in certain circumstances? Perhaps this could result in improved performance/battery life for certain low priority tasks. Although an application where 1% error is acceptable doesn't spring immediately to mind, maybe supercomputing - where anomalous results are checked and verified against each other...?
  • Re:Impossible design (Score:3, Interesting)

    by Turzyx (1462339) on Saturday May 29, 2010 @06:54PM (#32392684)
    Or worse, it could jump to itself repeatedly, thereby creating a HCF situation.
  • by Anonymous Coward on Saturday May 29, 2010 @07:01PM (#32392758)

    The summary talked about the communication links... I remember when we were running SLIP over two-conductor serial lines and "overclocking" the serial lines. Because the networking stack (software) was doing checksums and retries, it worked faster to run the line into its fast but noisy mode, rather than to clock it conservatively at a rate with low noise.

    If the chip communications paths start following the trend of off-chip paths (packetized serial streams), then you could have higher level functions of the chip do checksums and retries, with a timeout that aborts back even higher to a software level. Your program could decide how much to wait around for correct results versus going on with partial results, depending on its real-time requirements. The memory controllers could do this, using the large, remote SRAM and RAM spaces as an assumed-noisy space and overlaying its own software checksum format on top.

    This is really not so different from modern filesystems which start to reduce their trust in the storage device, and overlay their own checksum, redundancy, and recovery methods. You can imagine bringing these reliability boundaries ever "closer" to the CPU. Of course, you are right that it doesn't make sense to allow noisy computed goto addresses, unless you can characterize the noise and link your code with "safe landing areas" around the target jump points. It makes even less sense to have noisy instruction pointers, e.g. where it could back up or skip steps by accident, unless you can design an entire ISA out of idempotent instructions which you can then emit with sufficient redundancy for your taste.

  • A brainy idea. (Score:5, Interesting)

    by Ostracus (1354233) on Saturday May 29, 2010 @07:28PM (#32392962) Journal

    He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.

    I dub thee neuron.

  • by JoeMerchant (803320) on Saturday May 29, 2010 @07:29PM (#32392974)
    Why rewrite the application software? Why not catch it in the firmware and still present a "perfect" face to the assembly level code? Net effect would be an unreliable rate of execution, but who cares about that if the net rate is faster?
  • OpenCL (Score:3, Interesting)

    by tepples (727027) <tepples@gmaiBLUEl.com minus berry> on Saturday May 29, 2010 @08:02PM (#32393192) Homepage Journal

    AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice.

    As OpenCL and other "abuses" of GPU power become more popular, "colors a pixel wrong" will eventually happen in the wrong place at the wrong time on someone using a "gaming" card.

  • And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

    I think I know why. If free software lacks eye candy, free software has trouble gaining more users. If free software lacks users, hardware makers won't cooperate, leading to the spread of "paperweight" status on hardware compatibility lists. And if free software lacks users, there won't be any way to get other software publishers to document data formats or to get publishers of works to use open data formats.

  • by gman003 (1693318) on Saturday May 29, 2010 @08:19PM (#32393324)

    I've seen this before, except for an application that made more sense: GPUs. A GPU instruction is almost never critical. Errors writing pixel values will just result in minor static, and GPUs are actually far closer to needing this sort of thing. The highest-end GPUs draw far more power than the highest-end CPUs, and heating problems are far more common.

    It may even improve results. If you lower the power by half for the least significant bit, and by a quarter for the next-least, you've cut power 3% for something invisible to humans. In fact, a slight variation in the rendering could make the end result look more like our flawed reality.

    A GPU can mess up and not take down the whole computer. A CPU can. What happens when the error hits during a syscall? During bootup? While doing I/O?

  • Re:Wrong approach? (Score:4, Interesting)

    by somersault (912633) on Saturday May 29, 2010 @08:54PM (#32393538) Homepage Journal

    What use is a blazing fast computer that is no longer reliable

    Meh.. I'm pretty happy to have my brain, even if it makes some mistakes sometimes.

  • by thegarbz (1787294) on Saturday May 29, 2010 @10:08PM (#32393922)
    I can't see this working. The premise here was that the hardware allows faults, yet I don't see how you could design hardware like this to be accurate on demand. GPUs aren't only used for games these days. Would an error still be tolerated while running Folding@Home?
  • Re:Impossible design (Score:3, Interesting)

    by demerzeleto (1283448) on Saturday May 29, 2010 @11:07PM (#32394180)

    There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.

    Have you ever tried transferring large files over a 100 MBps ethernet link? Thats right, billions of bytes over a noisy, unreliable wired link. And how often have you seen files corrupted? I never have. The link runs along extremely reliably (BER of 10^-9 I think) with as little as 12MBps out of the 100MBps spent on error checking and recovery.

    Same case here. I'd expect the signal-to-noise ratio on the connects within CPUs (when the voltage is cut by say 25%) to be similar, if not better, than ethernet links. So the CPU could probably get along with lesser error checking and recovery. Or, if you choose applications (like video decoding or graphics rendering) that have no problems with a few bad bits here and there, you could manage with almost no ECC at all.

    If you were to plot Error Rates vs CPU power, I'd say most modern CPUs lie at the far end of the region of diminishing returns. Theres a gold mine to be reaped by moving backwards on the curve.

  • by 10101001 10101001 (732688) on Saturday May 29, 2010 @11:34PM (#32394296) Journal

    The whole point of the Crusoe was that it could distil down various types of instruction (e.g. x86, even Java bytecode) to native instructions it understood. It could run 'anything' so to speak, given the right abstraction layer in between

    Yea, uh, that's true for *any* general purpose processor. What Crusoe original promised was that this dynamically recompiled code might be either faster (by reordering and optimizing many instructions to fit Crusoe's Very Large Instruction Word design--not unlike how the Pentium Pro and above do it in hardware with multiple APU/FPU functional units) or more power efficiently (by removing the hardware parts of the reorderer/optimizer and having the software equivalent run unoften). Of course, the former just didn't hold because Intel/AMD could just pump out higher hertz processors and the latter didn't matter as much as simply underclocking the whole CPU when the system was idle (which is often enough). In short, Crusoe found two niches that both Intel and AMD cornered.

    Its lack of success was nothing to do with programming - just that no one needed a processor that could these things. The demand wasn't there

    Well, that's the other part of the equation. If the Crusoe had actually provided multiple abstraction layers and not just the one (the x86 one), perhaps they could have survived. Crusoe would have been a great platform for emulating the PSX, for example. Or providing multiple, concurrent x86/Java/whatever systems to sandbox for servers--and for which the power efficiency would be important. But, then, providing well-optimized and many software solutions isn't an easy task, especially when balanced against the task of avoiding running the "Code Morphing" software as much as possible to avoid all the penalties associated with it.

    In short, the problem wasn't the demand per se; it was a lack of supply. Pining the hopes of the company on a few niches of which the competitors managed to relatively quickly occupy certainly didn't help.

  • by trims (10010) on Sunday May 30, 2010 @01:26AM (#32394752) Homepage

    I see lots of people down on the theory - even though the original proposal was for highly-error forgiving applications - because somehow it means we can't trust the computations from the CPU anymore.

    People - realize that you can't trust them NOW.

    As someone who's spent way too much time in the ZFS community talking about errors, their sources and how to compensate, let me enlighten you:

    modern computers are full of uncorrected errors

    By that, I mean that there is a decided tradeoff between hardware support for error correction (in all the myriad places in a computer, not just RAM) and cost, and the decision has come down on the side of screw them, they don't need to worry about errors, at least for desktops. Even for better quality servers and workstations, there are still a large number of places where the hardware simply doesn't check for errors. And, in many cases, the hardware alone is unable to check for errors and data corruption.

    So, to think that your wonderful computer today is some sort of accurate calculating machine is completely wrong! Bit rot and bit flipping happens very frequently for a simple reason: error rates per billion operations (or transmissions, or whatever) have essentially stayed the same for the past 30 years, while every other component (and bus design, etc.) is pretty much following Moore's Law. The ZFS community is particularly worried about disks, where the hard error rates are now within two orders of magnitude of the disk's capacity (e.g. for a 1TB disk, you will have a hard error for every 100TB or so of data read/written). But, there's problems in on-die CPU caches, bus line transmissions, SAS and FC cable noise, DRAM failures, and a whole host of other places.

    Bottom line here: the more research we can do into figuring out how to cope with the increasing frequency of errors in our hardware, the better. I'm not sure that we're going to be able to force a re-write of applications, but certainly, this kind of research and possible solutions can be taken care of by the OS itself.

    Frankly, I liken the situation to that of using IEEE floating point to calculate bank balances: it looks nice and a naive person would think it's a good solution, but, let me tell you, you come up wrong by a penny more often that you would think. Much more often.

    -Erik

  • Amen to that. (Score:1, Interesting)

    by Anonymous Coward on Sunday May 30, 2010 @12:58PM (#32397942)

    As a game developer, I used to not think about the possibility of hardware errors much. Until I had a very difficult-to-pin-down bug which turned out to be a hardware defect in a single L2 cache line on the console hardware. It worked fine for the first few years of its life, and then this one bit developed a "stickyness" so that sometimes it would return the wrong value and cause our software to crash.

    We then ran an exhaustive RAM reading and writing test on all of the devkits our team was using, and it turned up *three more* kits that couldn't read and write correctly to all of their RAM. These are $10,000 devkits, but their reliability is about the same as the consumer hardware people have at home. Is it any surprise that consoles often need to be shipped back to MS or Sony or Nintendo and replaced? It no longer surprises me in the least.

    Desktop computers may be a little ahead of consoles in reliability, but when you're doing BILLIONS of calculations every second, its inevitable that random physical quirks (cosmic ray strike or whatever) will mess one of them up sooner or later. Anyone who overclocks their PC knows that if you OC too much you start to fail the reliability tests because errors are creeping in.

The number of arguments is unimportant unless some of them are correct. -- Ralph Hartley

Working...