Forgot your password?
typodupeerror
Hardware News Technology

When Mistakes Improve Performance 222

Posted by kdawson
from the let's-change-everything dept.
jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"
This discussion has been archived. No new comments can be posted.

When Mistakes Improve Performance

Comments Filter:
  • by sourcerror (1718066) on Saturday May 29, 2010 @06:41PM (#32392546)

    This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.

    People said the same about RISC processors.

  • by TheThiefMaster (992038) on Saturday May 29, 2010 @06:42PM (#32392562)

    Especially a JMP (GOTO) or CALL. If the instruction is JMP 0x04203733 and a transmission error makes it do JMP 0x00203733 instead, causing it to attempt to execute data or an unallocated memory page, how the hell can it recover from that? It could be even worse if the JMP instruction is changed only subtly, jumping only a few bytes too far or too close could land you the wrong side of an important instruction that throws off the entire rest of the program. All you could do is to detect the error/crash and restart from the beginning and hope. What if the error was in your error detection code? Do you have to check the result of your error detection for errors too?

  • by Anonymous Coward on Saturday May 29, 2010 @06:48PM (#32392618)

    More importantly, if the software is more robust so as to detect and correct errors, then it will require more clock cycles of the CPU and negate the performance gain.

    This CPU design sounds like the processing equivalent of a perpetual motion device. The additional software error correction is like the friction that makes the supposed gain impossible.

  • by bug1 (96678) on Saturday May 29, 2010 @06:59PM (#32392738)

    Ethernet is an improvement over than token ring, yet Ethernet has collisions and token ring doesn't.

    Token ring avoids collisions, Ethernet accepts collisions will take place but has a good error recovery system.

  • by Chowderbags (847952) on Saturday May 29, 2010 @07:19PM (#32392908)
    Moreover, if the processor goofs on the check, how will the program know? Do we run every operation 3 times and take the majority vote (then we've cut down to 1/3rd of the effective power)? Even if we were to take the 1% error rate, given that each core of CPUs right now can run billions of instructions per second, this CPU will fail to check correctly every second (even checking, rechecking, and checking again every single operation). And what about memory operations? Can we accept errors in a load or store function? If so, we can't in practice trust our software to do what we tell it. (change a bit on load and you could do damn near anything from adding the wrong number, to saying an if statement is true when it should be false, to not even running the right fricken instruction.

    There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.
  • by Anonymous Coward on Saturday May 29, 2010 @07:30PM (#32392980)

    I've read something similar to this in the past and the example they used is video playback. If a few pixels in a video frame are rendered incorrectly the end user probably won't even notice. I think the likely applications if this is in video decoders and gaming graphics.

  • by Angst Badger (8636) on Saturday May 29, 2010 @07:31PM (#32392986)

    ...the problem is software. In the last twenty years, we've gone from machines running at a few MHz to multicore, multi-CPU machines with clock speeds in the GHz, with corresponding increases in memory capacity and other resources. While the hardware has improved by several orders of magnitude, the same has largely not been true of software. With the exception of games and some media software, which actually require and can use all the hardware you can throw at them, end user software that does very little more than it did twenty years ago could not even run on a machine from 1990, much less run usably fast. I'm not talking enterprise database software here, I'm talking about spreadsheets and word processors.

    All of the gains we make in hardware are eaten up as fast or faster than they are produced by two main consumers: useless eye-candy for end users, and higher and higher-level programming languages and tools that make it possible for developers to build increasingly inefficient and resource-hungry applications faster than before. And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

    It really doesn't matter how powerful the hardware becomes. For specialist applications, it's still a help. But for the average user, an increase in processor speed and memory simply means that their 25 meg printer drivers will become 100 meg printer drivers and their operating system will demand another gig of RAM and all their new clock cycles. Anything that's left will be spent on menus that fade in and out and buttons that look like quivering drops of water -- perhaps next year, they'll have animated fish living inside them.

  • by DavidR1991 (1047748) on Saturday May 29, 2010 @08:04PM (#32393210) Homepage

    ...that the Transmeta Crusoe processor has sod-all to do with porting or different programming models. The whole point of the Crusoe was that it could distil down various types of instruction (e.g. x86, even Java bytecode) to native instructions it understood. It could run 'anything' so to speak, given the right abstraction layer in between

    Its lack of success was nothing to do with programming - just that no one needed a processor that could these things. The demand wasn't there

  • by TheGratefulNet (143330) on Saturday May 29, 2010 @08:17PM (#32393308)

    in fact, its the randomness of ethernet (back off and retry at random non-matching intervals) that gets you order. if everyone used the same backoff timers, they'd keep colliding; but add in some randomness and things work better.

    increase entropy to ensure order. ha! but its true.

  • by Interoperable (1651953) on Saturday May 29, 2010 @08:34PM (#32393432)

    I did some digging and found some material by the researcher, unfiltered by journalists. I don't have any background in processor architecture but I'll present what I understood. The original publications can be found here [illinois.edu].

    The target of the research is not general computing, but rather low-power "client-side" computing, as the author puts it. I understand this to be decoding application, such as voice or video in mobile devices. Furthermore, the entire architecture would not be stochastic, but rather it would contain some functional blocks that are stochastic. I think the idea is that certain mobile hardware devices devote much of their time to specialized applications that do not require absolute accuracy.

    A mobile phone may spend most of it's time being used encode/decode low resolution voice and video and would have significant blocks within the processor devoted to those tasks. Those tasks could be considered error tolerant. The operating system would not be exposed to error-prone hardware, only applications that use hardware acceleration for specialized, error-tolerant tasks. In fact, the researchers specifically mention encoding/decoding voice and video and have demonstrated the technique on encoding h.264 video.

  • by Anonymous Coward on Saturday May 29, 2010 @08:35PM (#32393434)

    I think Mr Kumar is confusing the performance of the designer to develop a useful power effecient product today on a modern process with the performance of the end result. There is no law or provable proposition that that a useful processor needs to be sloppy to outperform a neat competitor. This only holds true when you fail to include the cost of being sloppy and limit the intelligence and creativity of the designer. Any figures you produce to prove your point are by definition limited to a narrow limited set of defined tradeoffs.. It does not represent what is possible when someone smarter than yourself is desinging a solution.

    The space is difficult and getting more and more so. Deal with it or find another job. For quite some time now innovations in the space have always come from techniques to mitigate complexity and error. When your designing non-trivial ASICs its what you do.

    In certain areas analog computers make sense. Heck our brains are analog computers but asking a classic computing environment to check itself is a non-starter in terms of any product users will accept.

    Circut design is somewhat of an art. There are an infinite array of subtle tradeoffs and clever hacks one can use to improve performance such as use of crosstalk to bootstrap charging of neighboring caps, clock gating, distributed clocking, intentional glitching, even the use of analog circuts in certain limited roles.

    What pisses me off the most about articles like this is that designers suffer from tunnel vision and therefore act like morons. I mean look at a modern desktop PC. Intel et al tout their speedstep, bus power management, LPC..etc to save energy and they have epicly failed. Why does a computer doing absoultely nothing need to use >100 watts to sit idle? If they can't get reasonable power scaling from clock gating then why not just design an idle processor thats slow and stupid (ie ATOM) and shut the other crap down when its not needed. If people really cared about power there are a lot of realitivly low tech solutions that would work and make huge dents in world demand for energy to power electronics.

    But we still have a situation where GPU designers would rather let their processors idle at 70c to protect against temp gradients and not have to account for effects of temperature changes on their circuts.

  • by Interoperable (1651953) on Saturday May 29, 2010 @08:46PM (#32393494)

    The research [illinois.edu] is targeted specifically at dedicated audio/video encoding/decoding blocks within the processors of mobile devices and similar error-tolerant applications. The journalist just didn't mention the fact that the idea isn't to expose the entire system to fault-prone components. When considered in the light that the most power-sensitive mainstream devices (cell-phones) spend most of their time doing these error-tolerant tasks, the research becomes quite interesting. They claim to have demonstrated the effectiveness of the technique to encode an h.264 video.

  • For this, I'd point to the RISC vs. CISC debate. RISC programs took many more instructions to do the same things, but the gain in performance was so great that you ended up with greater performance. Extra steps = some amount of overhead, but so long as the net gain is greater than the net overhead, you will gain overall. The RISC chip demonstrated that such examples really do exist in the real world.

    But just because there are such situations, it does not automatically follow that more steps always equals greater performance. It may be that better error-correction techniques in the hardware would handle the transmission errors just fine without having to alter any software at all. It depends on the nature of the errors (bursty vs randomly-distributed, percent of signal corrupted, etc) as to what error-correction would be needed and whether the additional circuitry can be afforded.

    Alternatively, the problem may in fact turn out to be a solution. Once upon a time, electron leakage was a serious problem in CPU designs. Then, chip manufacturers learned that they could use electron tunneling deliberately. The cause of these errors may be further electron leakage or some other quantum effect, it really doesn't matter. If it leads to a better practical understanding of the quantum world to the point where the errors can be mitigated and the phenomenon turned to the advantage of the engineers, it could lead to all kinds of improvement.

    There again, it might prove redundant. There are good reasons for believing that "Processor In Memory" architectures are good for certain types of problem - particularly for providing a very standard library of routines, but certain opcodes can be shifted there as well. There is also the Cell approach, which is to have a collection of tightly-coupled but specialized processors of different kinds. A heterogenious cluster with shared memory, in effect. If you extend the idea to allow these cores to be physically distinct, you can offload from the original CPU that way. In both cases, you distribute the logic over a much wider area without increasing the distance signals have to travel (you may even shorten the distance). As such, you can eliminate a lot of the internal sources of errors.

    It may prove redundant in other ways, too. There are plenty of cooling kits these days that will take a CPU to much lower temperatures. Less thermal noise may well result in fewer errors, since that is a likely source of some of them. Currently, processors often run at 40'C - and I've seen laptop CPUs reach 80'C. If you can bring the cores down to nearly 0'C and keep them there, that should have a sizable impact on whether the data is being transmitted accurately. The biggest change would be to modify the CPU casing so that heat is reliably and rapidly transferred from the silicon. I would imagine that you'd want the interior of the CPU casing to be flooded with something that conducts heat but not electricity - fluorinert does a good job there - and then have the top of the case to also be an extra-good heat conductor (plastic only takes you so far).

    However, if programs were designed with fault-tolerence in mind, these extra layers might not be needed. You might well be able to get away with better software on a poorer processor. Indeed, one could argue that the human brain is an example of an extremely unreliable processor whose net processing power (even allowing for the very high error rates) is far far beyond any computer yet built. This fits perfectly with the professor's description of what he expects, so maybe this design actually will work as intended.

  • by Draek (916851) on Saturday May 29, 2010 @09:02PM (#32393604)

    for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

    You yourself stated that high-level languages allow for a much faster rate of development, yet you dismiss the idea of using them in the F/OSS world as a mere "desire to emulate corporate software development"?

    Hell, you also forgot another big reason: high-level code is almost always *far* more readable than its equivalent set of low-level instructions, the appeal of which for F/OSS ought to be similarly obvious.

    Sorry but no, the reason practically the whole industry has been moving towards high-level languages isn't because we're all lazy, and if you worked in the field you'd probably know why.

  • by pipatron (966506) <pipatron@gmail.com> on Saturday May 29, 2010 @09:04PM (#32393618) Homepage

    Not very insightful. You seem to say that a CPU today is error-free, and if this is true, the part of the new CPU that does the checks could also be made error-free so there's no problem.

    Well, they aren't rock-solid today either, so you can not trust their output even today. It's just not very likeley that there will be a mistake. This is why mainframes execute a lot of instructions at least twice, and decides on-the-fly if something went wrong. This idea is just an extension of that.

  • by Homburg (213427) on Saturday May 29, 2010 @09:13PM (#32393676) Homepage

    So you're posting this from Mosaic, I take it? I suspect not, because, despite your "get off my lawn" posturing, you recognize in practice that modern software actually does do more than twenty-year-old software. Firefox is much faster and easier to use than Mosaic, and it also does more, dealing with significantly more complicated web pages (like this one; and terrible though Slashdot's code surely is, the ability to expand comments and comment forms in-line is a genuine improvement, leaving aside the much more significant improvements of something like gmail). Try using an early 90s version of Word, and you'll see that, in the past 20 years word processors, too, have become significantly faster, easier to use, and capable of doing more (more complicated layouts, better typography).

    Sure, the laptop I'm typing this on now is, what, 60 times faster than a computer in 1990, and the software I'm running now is neither 60 times faster nor 60 times better than the software I was running in 1990. But it is noticeably faster, at the same time that it does noticeably more and is much easier to develop for. The idea that hardware improvements haven't led to huge software improvements over the past 20 years can only be maintained if you don't remember what software was like 20 years ago.

  • by rdnetto (955205) on Sunday May 30, 2010 @12:09AM (#32394456)

    a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

    Probably because the major FOSS developers are in corporate software development.

The reason why worry kills more people than work is that more people worry than work.

Working...