Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
Hardware

Revolutionizing x86 CPU Performance 392

NickSD writes "ChipGeek has an interesting article on increasing x86 CPU performance without having to redesign or throw out the x86 instruction set. Check it out at geek.com."
This discussion has been archived. No new comments can be posted.

Revolutionizing x86 CPU Performance

Comments Filter:
  • Why? (Score:4, Insightful)

    by Anonymous Coward on Friday October 11, 2002 @08:39AM (#4431239)
    Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
    • Cache is the key (Score:3, Insightful)

      by Anonymous Coward
      I've got three words for you: cache, cache and cache.

      Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.

      • Re:Cache is the key (Score:5, Informative)

        by Anonymous Coward on Friday October 11, 2002 @10:41AM (#4431970)
        Cache is a huge Intel problem. 20K [geek.com] L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.

        AMD has 128K L1 since the original Athlon [geek.com], and had 24K in the K5 [geek.com].

        The Transmeta 3200 [geek.com] and the Motorola G4 [geek.com] both have 96K, the UltraSparc-III [geek.com] has 100K, Alpha [geek.com] had 128K when it died, and HP's PA-8500 [geek.com] has a whopping 1.5MB.

        They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
        • by mmol_6453 ( 231450 ) <short,circuit&mail,grnet,com> on Friday October 11, 2002 @01:22PM (#4433341) Homepage Journal
          20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.

          Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.

          A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.

          Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?

          In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million [slashdot.org], not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
          • Re:Cache is the key (Score:4, Informative)

            by orz ( 88387 ) on Friday October 11, 2002 @05:01PM (#4434688)
            Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.

            For a graphical depiction of the difference between Intel and AMD cache performances, try this link:
            http://www.tech-report.com/reviews/2002q1/n orthwoo d-vs-2000/index.x?pg=3
            It was the first think that came up in a google search for linpack and "cache size".
      • by Sivar ( 316343 ) <charlesnburns[.]gmail@com> on Friday October 11, 2002 @12:05PM (#4432595)
        I've got three words for you: cache, cache and cache.

        Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.

        No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math.
        Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache.
        Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%.
        Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.

        Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:

        1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.

        2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.

        3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.

        That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.

        Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.

        FYI
        If you are interested in competant, intelligent, technical reviews of hardware, you might like
        www.aceshardware.com
    • Re:Why? (Score:3, Interesting)

      by io333 ( 574963 )
      Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
      • dramatically increasing program execution speed several orders of magnitude

        Where did you read this?

        Also, even with the hardware bottlenecks the Anonymous Coward mentions?

        You ask yourself "why not?"... I ask can only ask myself "how?" :-) Sure, I see how it's supposed to work in theory with zero bottlenecks, but how it works in practice is a completely different thing.
      • Re:Why? (Score:5, Interesting)

        by DustMagnet ( 453493 ) on Friday October 11, 2002 @09:29AM (#4431513) Journal
        I read the article. From what I can see this guy writes lots of assembly, but knows very little about how processors are designed. The huge gains you all see have already been made by register renaming and caches. There might be some gain left by giving the compiler direct control over these, but at the cost of much complexity in the register renaming hardware. The P4 has a very deep pipeline. Looking for register conflicts is hard enough without adding another layer of redirection.

        The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.

        • Re:Why? (Score:3, Interesting)

          by PetiePooo ( 606423 )
          This may boil down to the generic do it in hardware v.s. do it in software debate. Do we reorder the instructions in hardware (ala Pentium and Athlon), or make the compiler do it (ala Itanium)? Do we make the hardware predict branches or have the compiler drop hints? Register renaming as done by modern RISC-core x86 implementations likely address many of the issues he proposes an extension and a smart compiler (or assembler) would solve. Now, a 386, that would benefit from his technique.

          However, if we're going to revise that architecture, I say we add MMX and call it a 486. Then, we can add SSE and call it a Pentium.. And then, ...

          Oh, wait. nevermind.
        • Re:Why? (Score:5, Informative)

          by Kz ( 4332 ) on Friday October 11, 2002 @10:33AM (#4431932) Homepage
          Damn Right!

          Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.

          Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.

          P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
          • Re:Why? (Score:3, Informative)

            by Chris Burke ( 6130 )
            Register renaming already does what's being proposed here, but transparently.

            Well, not exactly. Renaming takes care of the case where two things write, say, EAX, by allowing both to go with different physical registers. I.E. you don't have to stall because you only have one architected "EAX" register.

            However, your program is still limited to only 8 visible values at any instant. So when you need a 9th -thing- to keep around, you have to spill some registers onto the stack. Register renaming doesn't solve this problem.

            His idea would, but it's still a stupid idea. :P

          • Re:Why? (Score:5, Informative)

            by MajroMax ( 112652 ) on Friday October 11, 2002 @11:59AM (#4432523)
            Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.

            Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.

        • by Chris Burke ( 6130 ) on Friday October 11, 2002 @10:46AM (#4432012) Homepage
          Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.

          Here's why his idea sucks:

          1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.

          2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.

          3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.

          4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.

          5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.

          Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.

          Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.

          I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
      • Re:Why? (Score:5, Insightful)

        by sql*kitten ( 1359 ) on Friday October 11, 2002 @09:32AM (#4431527)
        Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"

        It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.
      • by purrpurrpussy ( 445892 ) on Friday October 11, 2002 @09:42AM (#4431586)
        You are VERY confused.

        1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.

        1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.

        2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....

        3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!

        As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.

        Personally.... I'd go for a stack machine. Easily the most efficient compute engine.

        Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY.... ;-)
    • Re:Why? (Score:3, Insightful)

      by Anonymous Coward
      You don't get it...

      What Intel is currently doing is putting a turbo on an old and obsolete architecture.

      By having more GP registers, you could make the same job more easily and with better performances (and easier to read if you code in ASM).
      As it is now, you need to many memory access for simple operations.
      With more registers, you would need less clock speed.

      It's not all about MHz's.
      • Re:Why? (Score:5, Insightful)

        by Junks Jerzey ( 54586 ) on Friday October 11, 2002 @10:01AM (#4431717)
        With more registers, you would need less clock speed.

        Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
        • Re:Why? (Score:5, Interesting)

          by p3d0 ( 42270 ) on Friday October 11, 2002 @11:15AM (#4432207)
          Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
          Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.

          At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.

    • Re:Why? (Score:2, Informative)

      by hatchet ( 528688 )
      You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.
      We must not forget that most operations processor does are data movements and not calculations.

      All three x86 problems which are described by article author are fixed with IA-64 architecture, but not so with AMD's x86-64.
      • Re:Why? (Score:3, Interesting)

        by fstanchina ( 564024 )

        And you don't understand how modern processors really work. First, they have several levels of cache memory exactly because you don't want to go out to main memory too often. Second, they have many more registers than the assembly instructions can see: register renaming, speculative eeecution and all those tricks reorganize instructions so that the CPU core doesn't really have to move data back and forth between memory and registers so often.

        IANACPUD (I'm not a CPU designer) so I'm not going to try to descripe this stuff further, but articles abound. Here are some: Into the K7, Part One [arstechnica.com] and Into the K7, Part Two [arstechnica.com]

    • Re:Why? (Score:5, Insightful)

      by OttoM ( 467655 ) on Friday October 11, 2002 @09:17AM (#4431434)
      Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...

      No. Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.

      The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.

      After that, it doesn't hurt to make the bus bandwidth bigger.

  • by von Prufer ( 444647 ) on Friday October 11, 2002 @08:40AM (#4431244)
    That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!
    • by pokeyburro ( 472024 ) on Friday October 11, 2002 @09:41AM (#4431579) Homepage
      Other new commands:

      LIE Launch IE
      LMW Launch MS Word
      LME Launch MS Excel
      LMO Launch MS Outlook
      LMOV Launch MS Outlook Virus
      LCNR Launch Clippy for No Reason
      DPRN Display Pr0n
      SPOP Show IE Popup
      SPU Spam User
      SHDR Send Hard Drive Contents to Redmond
      RBT Reboot
      SBS Show Blue Screen
      • by WWWWolf ( 2428 )
        RBT Reboot
        SBS Show Blue Screen

        Argh, get this CISC rubbish out of my sight!

        Real people used stuff like jmp $fce2 for the first, but the latter was a little bit more complex because of the blue part: lda #$06 ; sta $d020 ; sta $d021 ; hlt (of course, hlt is an undocumented opcode, and since C64 boots in less than a second from ROM, it hardly is as frustrating as the bluescreen in Windows).

        =)

      • by BasharTeg ( 71923 ) on Friday October 11, 2002 @12:48PM (#4433006) Homepage
        That's rather like the PPC instruction set!

        LPS - Launch Photoshop
        DGB - Do Gaussian Blur
        ES - Encode Sorenson
        DS - Decode Sorenson
        CSAWEF - Create Switch Ad With Ellen Feiss

        And my personal favorite:

        BICPUWPBIGBASE - Beat Intel CPU With Proprietary Benchmark Involving Gaussian Blurs And Sorenson Encoding
  • It's called overclocking.
  • by Adam Rightmann ( 609216 ) on Friday October 11, 2002 @08:46AM (#4431272)
    I'm getting out the soldering iron, and hooking up some more registers to my Cerelon 600! Thanks ChipGeek!
  • Simple solution (Score:5, Informative)

    by Anonymous Coward on Friday October 11, 2002 @08:47AM (#4431275)
    Buy Intel's C/C++ compiler (icc) and download the high performance, Intel CPU optimized math libraries from Intel's site.

    The compiler does for you exactly what the article says. It uses MMX and other extensions as well as vectorizes loops and does interprocedural optimization and OpenMP.

    • Interesting!

      It doesn't come as a surprise Intel has already thought of this before going into simulating dual processrors via HyperThreading, NetBursting and several other advanced techniques to improve performance.
  • Silicon complexity (Score:4, Informative)

    by sheddd ( 592499 ) <jmeadlock&perdidobeachresort,com> on Friday October 11, 2002 @08:48AM (#4431278)
    The guy seems to say:

    'Boy this instruction set would be better at the same clock-speed'

    'All they'd have to do is update their verilog code and run it thru synthesis'

    Well they don't make processors from Verilog code; they'd be huge hot and slow. All this increased complexity he wants would dictate more transistors and a bigger die (along with lots've development time).

    With the above said, it still might be a good idea; I don't know.
    • i designed cpu architecture for an undergrad class last spring. i'm familiar with assembly as well as architecture including pipelining and all that, and i'm not convinced that this solution is all that much of an improvement. like the previous poster said, you have to add more hardware. if you want this register-mapping to not take a long time, you need to add the new registers, reconnect the general purpose register address lines through a translator to these registers, and then translate what you get out of there to select the register it's mapped to. this isn't really that big of a deal compared to how much is already on these chips, but somebody has to design it and make it work--this guy didn't do that and i'm pretty sure he's more of a programmer than a hardware guy. as for the speed improvement this would provide, i don't think it's as good as he thinks it is. while he mentions that the whole pipeline has to be paused for any instructions that change his mapping registers, he doesn't seem to realize how big of an impact it is to have only one instruction in the pipeline at a time. if you change your register map every other instruction, you've pretty much thrown out any benefit you may have had from a pipeline. this too could be worked around, but it means that every assembly programmer who wants to use these .x instructions would need to understand the effect it has on the pipeline if they actually want to get more speed out of it. also, since this involves changing the architecture of the chip, none of these .x instructions will work on any chips that are already out there. what happens if you try to do a couple div.x instructions and find out that it used edx:eax for both when that's not what you wanted? if it get's put in the chips it'll be a while before it's used, imo. besides all that, didn't anybody ever tell this guy that 8 (which is actually 6) general purpose registers ought to be enough for anybody?
  • Um... yeah.. right (Score:4, Interesting)

    by WPIDalamar ( 122110 ) on Friday October 11, 2002 @08:49AM (#4431283) Homepage
    So this guy wants to make registers virtual... won't this add a lot of silicon to each register, and make every register access slower? Every input that takes a register would need to become a switch instead of just a solid connection.

    From the software side of things, this sounds great... I just wonder how much it would slow down the hardware side.

    Of course I'm no chip designer, but neither was this guy.
    • by Isle ( 95215 ) on Friday October 11, 2002 @09:22AM (#4431462) Homepage
      Registers are already virtual. It is called register renaming and is necessary to gain good speed-ups for superscalar processing(executing more than one instruction at a time).

      The question is if his register-mapping can play well with normal register-renaming. I think it would be trouble currently there are only two layers. The physical register the processor sees and the virtual ones that it exports to the programmer. If this gets added there would be the "real" physical registers the processor map out, the virtual physical register the programmer map out, and the virtual virtual registers that are actually used in normal code.
  • Register Renaming (Score:5, Interesting)

    by PhoenxHwk ( 254106 ) on Friday October 11, 2002 @08:51AM (#4431296) Homepage
    This article says (in a very long-winded way) that he wants to implement something like register renaming for the x86. Register renaming is a common parallel processing technique that gets more parallelism out of code by easing the limitations imposed by the number of registers a user has. One thing that Chipgeek says is that it would require special instructions, etc. This seems sort of backwards because classic renaming techniques are handled automagically by the processor for you. In his case, he is trying to make it explicit in order to allow all registers to be used general-purpose style. I'm not so sure about how worthwhile this is because it will (obviously) require recompiled code for the new extensions - and one of the things that's holding x86 back is binary compatibility.
    • Re:Register Renaming (Score:5, Informative)

      by minektur ( 600391 ) <junk@cli[ ]org ['ft.' in gap]> on Friday October 11, 2002 @09:03AM (#4431364) Homepage Journal
      They already do this internally -- they have a very large number of registers that get aliased to the regular 10 registers - they use some for calculating branch speculations (and they swap in the registers from whichever set of speculative execution track actualy happend). They also switch alaised register sets for every context switch.

      What this guy wants is a way to have user-level control over the register alaises, and it might not be a bad idea, but I dont think he'll see as much gain from it as he expects, since there is lots of magic going on behind the scenes already with register alaising -- I'm guessing that if he just had seperate processes both using registers intelligently, that he could get as much done as he could with a single process and more registers. Since the cost of context switch is already aleviated, there wouldn't be much overhead. The only overhead there would be would be in the parallelizable algorithms themselves. However, we should already be wanting to do that work to take advantage of SMP...
      • by Merlin42 ( 148225 ) on Friday October 11, 2002 @09:14AM (#4431415)
        Hmm so he wants to give the compiler control of the some of the automagic optimizations that 'normal' CPUs use these days ... I think I've heard this one before ... Oh yeah its called VLIW ... er ... EPIC and it has given us the wonderous ITANIC ... er ... Itanium.
  • Patent! (Score:2, Funny)

    Rick should have patened these ideas or sold them to either AMD or Intel!



    I also noticed the ad on geek.com was for job-geek ... maybe someone at AMD or Intel should consider giving Rick a job! (but he appears to be intelligent enough, so he probably already has one) :)

    If you have any questions, please feel free to e-mail me.

    Maybe "they" will ... there's gotta be just a couple people from Intel or AMD that read /.

    :)

  • Speed speed speed (Score:5, Insightful)

    by Anonymous Coward on Friday October 11, 2002 @08:53AM (#4431306)
    Computer speed is like money. People have got this idea that endly amounts of it will increase our contentment. The problem is there is more to it all than that... What most computer users want to use computers for (internet, chatting, email, typing, solitare) should be able to be done quite well on a CPU from the 1980s. But instead, there are viruses, awful UIs, vastly bloated software and an enduser facing a constant battle to either use the computer or pay the money to buy a Mac ;) Human society is dumping vast amounts of resources on buying new computers, upgrading, and developing ever faster CPUs without actually making damn good designed systems which does what it is meant to, doesn't break down, is easy to use, is cheap, and lasts for ages without problems.
    As is usual, our priorities are messed up.
    • I totally agree with you. There are applications where you NEED this king of speed (research, databases, web-porn industy). As I've worked in 2 out of 3, it is nice to have a computer that can do all of that at home. And it is resonably priced too.
  • It sounds like a pretty decent idea to me. Granted, I'm no assembly expert (I'm just now in my Microprocessors class, which is based on the Z80), but I don't see how having more registers could be a bad thing. Anything that keeps operations there inside the CPU rather than going out to memory would pretty much have to be faster. I especially like the fact that he's implemented it such that no current code would be affected. THAT is a key point right there.

    Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...

    But yeah. Cool idea, well thought out. Petition for Intel, anyone?

    Mark Erikson
    • by Anonymous Coward
      This is a classic Slashdot comment.

      "Hey I'm currently in my 2nd year at college, but what the heck I think I'm qualified to commment here"

      "I think Intel need to employ this guy, I mean they must have overlooked this"

      "Cool - I wonder if I could think of something like this"

      Don't worry - you will.
  • RISC (Score:5, Interesting)

    by e8johan ( 605347 ) on Friday October 11, 2002 @08:56AM (#4431328) Homepage Journal
    Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!

    As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!

    If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).

    And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
    • Re:RISC (Score:2, Insightful)

      by RegularFry ( 137639 )
      I don't think anyone would disagree with that, but that's not the issue. What he's saying is, given that we've got to stick with x86 for historical and commercial reasons, this would be a relatively quick and easy way to allow the compilers to produce *much* groovier code.
      • Re:RISC (Score:3, Insightful)

        by e8johan ( 605347 )
        If we're going to stick to the x86 we still do not want to add complexity. I also tried to point out how easy it would be to move to a new architecture.
        As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)
      • Re:RISC (Score:3, Informative)

        by Milican ( 58140 )
        RTFA or nicely put...read the article. By adding the instructions he reduced the complexity of shifts, the multiple ordered instructions it takes to do one thing, and increases the visibility of all the registers. There are added instructions, but the benefit is reduced complexity in assembly instructions due to greater direct accessibility of all the registers.

        JOhn
    • umm, an intel cpu pretty much beats the pants off anything else on the market. On the downside, it's pretty tought to stuff 134 p4's in a server the way you can with a sparc or a powerpc.
    • Amen, brother (Score:3, Insightful)

      by mekkab ( 133181 )
      It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).

      Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
    • Re:RISC (Score:2, Interesting)

      by shadow303 ( 446306 )
      It would definitely be nice to get rid of the legacy cruft and move to a different architecture, however I doubt that this will happen until Intel and AMD start hitting major stumbling blocks. The itertia just seems to great. From what I hear (sorry I don't have a source, but I think I heard it in my Computer Architecture class), the cores of the current x86 chips are essentially RISC, and have a translation layer wrapped around it (convert x86 instructions into the internal RISC instructions).
      • You are right that the moder x86 implementations are RISCs with a translation layer around them (except Crusoe which is a VLIW with software translation - much cooler 8P ). Now just imagine if we could get direct access to those highly optimized RISC cores instead of having to code in x86 machine code.
    • by killmenow ( 184444 ) on Friday October 11, 2002 @09:27AM (#4431492)
      As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching...
      But a lot of the code running today wasn't "written today" if you know what I mean.
      The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.

      A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.

      Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.

      In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
    • Re:RISC (Score:5, Interesting)

      by Zathrus ( 232140 ) on Friday October 11, 2002 @09:33AM (#4431532) Homepage
      Ok, when you get to the Real World, let us know.

      Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.

      Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.

      Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.

      You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).

      All that said, go look at the SPEC CInt2000 [specbench.org] and FP2000 results [specbench.org]. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?

      Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.

      Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.
    • Re:RISC (Score:2, Insightful)

      by earthman ( 12244 )
      RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
      Not entirely true. RISC instruction sets can be quite huge too. And the whole idea of RISC is to take the complexity out of the hardware and put it into the compiler instead. It is easier to optimize for x86 than RISC.
    • Re:RISC (Score:5, Informative)

      by snatchitup ( 466222 ) on Friday October 11, 2002 @09:47AM (#4431622) Homepage Journal
      Hell yeah!

      I myself am an old x86 Assembly hacker.

      When I started looking at the ARM chips I wondered why we ever used x86's etc.

      RISC / CISC is really a misnomer.

      RISC has plenty of instructions, and it's meant to be super-scaler.

      It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.

      With Intel x86, everything has it's place.

      Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.

      Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386

      Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.

      The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.

      The article should really just be called:

      "An Acedemic Exercise in Register Gymnastics"

  • by Andy Dodd ( 701 ) <`atd7' `at' `cornell.edu'> on Friday October 11, 2002 @08:56AM (#4431329) Homepage
    Linux kernel source - memcpy() anyone?

    (On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)

    This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.
  • by seanellis ( 302682 ) on Friday October 11, 2002 @08:57AM (#4431333) Homepage Journal
    The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.

    Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).

    Worse, this scheme would not benefit existing code - it still requires code changes to work.

    Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp .)
  • by guidemaker ( 570195 ) on Friday October 11, 2002 @08:57AM (#4431336)
    I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.

    When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
  • by Alien54 ( 180860 ) on Friday October 11, 2002 @08:59AM (#4431348) Journal
    This is what I call the big problem. That design is utterly abominable. We live in a world where it's nothing to have 1 gigabyte of RAM in a computer. We have 80 GB hard drive platters now, allowing even greater-sized drives. And yet at the heart of every single one of your x86 computers out there, a mere 6 GP registers are doing nearly all of the processing. It's amazing. And it's something I've personally wrestled with every day of my assembly programming career.

    This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.

    The upside is being able to run old software on new hardware. You don't want to break too many things.

    • Microchannel was the bus you are thinking about. It actually was very good, but wan't backward compatible with ISA. EISA was the "rest of the industry's" response to provide a 32-bit bus that was backwards compatible. It wasn't a very good implementation since it was still locked at 8MHz.
    • by Zathrus ( 232140 ) on Friday October 11, 2002 @09:57AM (#4431696) Homepage
      As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.

      Anyone else remember the horrors of all those damn control files on floppies?

      There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.

      The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).

      Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
  • Full Circle ... (Score:3, Insightful)

    by tubs ( 143128 ) on Friday October 11, 2002 @09:02AM (#4431360)
    I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)

    I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
  • by wiredog ( 43288 ) on Friday October 11, 2002 @09:08AM (#4431387) Journal
    segment:offset addressing? He's doing it with registers, but it seems the same sort of thing. One register is for segment, the other is the offset?

    Well, not quite, but it has the same flavor.

    After working in x86 assembly, I really appreciated high level and minimally complex languages like C.

  • by Lomby ( 147071 ) <andreaNO@SPAMlombardoni.ch> on Friday October 11, 2002 @09:09AM (#4431392) Homepage
    The guy does not realize that what he proposed is not at all simple to implement in silico.

    This two additional mapping register would complicate the pipeline hazard detection in an exponential way.

    Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.

    Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
  • by shimmin ( 469139 ) on Friday October 11, 2002 @09:09AM (#4431394) Journal
    While the base idea is interesting (add instructions that support using the multimedia registers as GP registers), I suspect that actually implementing the functionality of the GP registers in the multimedia ones could result in a prohibitively expensive CPU.

    Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.

    I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.

    Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.

    So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?

  • by gpinzone ( 531794 ) on Friday October 11, 2002 @09:12AM (#4431404) Homepage Journal
    The whole gist of the article has to do with the x86's lack of general purpose registers. While this is true, you're not going to solve all of the x86 shortcomings simply by figuring out a way to add more of them. There are MANY things wrong with the x86 design; GP registers are just one of them. There's an entire section in the famous Patterson [amazon.com] book that goes into all of the issues in much more detail than I care to state here.

    Besides, there's already more efficient (albiet complex) solutions to extend registers that make much more sense in the current world of pipelined processors. Register renaming [pcguide.com] is one such example.
  • Revolutionizing?? (Score:3, Interesting)

    by Jugalator ( 259273 ) on Friday October 11, 2002 @09:15AM (#4431421) Journal
    It's interesting to hear "revolutionizing performance" in the same topic as instruction level fiddling. The only way to give truly "revolutionizing" performance is to do high level optimizations.

    When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts. :-)

    Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.
  • From what I gathered in the article, it seems like he is proposing a scheme by which normally unused registers (MMX, etc) can be used as general purpose registers. To do this, he considers an aliasing system. My question is, why can't a x86 programmer today just use those MMX registers for more general purposes? I'm sure there's a good reason, I just can't figure it out from the article - thanks
  • And we all love it for the same reason we love mutant superhuman zombies. :o)
  • Cool idea (Score:2, Funny)

    by tomstdenis ( 446163 )
    I want to form a company that makes a cpu that translates x86 instructions on the fly to RISC instructions that operate in parallel.

    I'll call my company transmeta!

    Or in the words of that new dell commercial

    "Sure we'll call it 1-800 they already do that!".

    Tom
  • I actually understood that. And I haven't done assembly language programming since the old 8086. (Segment registers, *shudder*...)

    Jack William Bell
  • I found the article intriguing, but during the entire verbose, self-important sounding read, I was wondering how ISRs would be handled. For example, if the RMC were set to revert to the default mapping in three ops, and an ISR interrupted after the first op, would it revert to the default mapping in the middle of the ISR?

    Fortunately, that issue is addressed in his Message Parlor [geek.com]. The full text of his response to BritGeek follows:

    Presently the registers are saved automatically by the processor in something called a Task State Segment (TSS) during a task switch. There are currently unused portions of TSS which could be utilized and (sic) for RM and RMC during a task switch.


    The PUSHRMC and POPRMC instructions are available for explicit saves/restores of the RM and RMC registers in general code. I don't recommend it, however. The decoders would be physically stalled until the RM/RMC registers are re-populated. It would be better to use explicit MOVRMCs in general code.

    - Rick C. Hodgin, geek.com
    He may be onto something afterall...
  • x86 Emulator? (Score:2, Interesting)

    by Shadow2097 ( 561710 )
    From the sounds of the article, he wants to make register mappings more logical than virtual. My knowledge of assembly level programming is pretty basic, but I do agree that adding more GP registers would probably increase performance measureably.

    His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!

    I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?

    -Shadow

  • ...And the best part is that I believe this is something that could be implemented in hardware in a manner which could be resolved and entirely applied during the instruction decode phase, thereby never passing the added assembly instructions any further down the instruction pipeline, and thereby not increasing the number of clock cycles required to process any instruction. I can provide technical details on how that would work to anyone interested. Please e-mail me if you are....

    If this is really acomplishable without wasting *any* extra cpu time (that waste would aply to *all* instructions the CPU goes through!) this is indeed a good stunt that could work out to add a substancial ooomph to x86 performance with the code we have today.
    Thank god, 'cuz' my Athlon is to hot allready and I'm kinda sceptical about watercooling. :-)
    Then again, that's a big "if".
  • Great... (Score:2, Interesting)

    A segment architecture for memory wasn't nasty enough, now we want to have a segment register for the registers?

    Thanks, no.

  • Comment removed (Score:3, Insightful)

    by account_deleted ( 4530225 ) on Friday October 11, 2002 @09:59AM (#4431709)
    Comment removed based on user account deletion
  • by mick29 ( 615466 ) on Friday October 11, 2002 @10:27AM (#4431892)
    I do not like the changes proposed although x86 is awfully flawed (not enough GP registers, terribly overloaded instruction set {anyone ever used BCD commands? -- Yes, I hear the loud "We do" from the COBOL corner.}, you name it... ).
    But this change would:

    Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)

    Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...) Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
    Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.

  • by zaqattack911 ( 532040 ) on Friday October 11, 2002 @10:32AM (#4431926) Journal
    I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.

    Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.

    Do you really think investers give a shit about registers?

    --Marketing 101
  • by gillbates ( 106458 ) on Friday October 11, 2002 @10:39AM (#4431963) Homepage Journal
    The only potential downfall I see in this design is the possible pipeline stall seen when RM/RMC have to be populated from stack data. When that happens, no assembly instructions can be decoded until the POPRMC instruction completes and RM/RMC are loaded with the values from the stack.

    Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.

    Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.

    I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.

    A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything, Intel should stop adding instructions and start adding registers.

  • by Christopher Thomas ( 11717 ) on Friday October 11, 2002 @11:19AM (#4432234)
    The part that confuses me is that, since code would need to be recompiled to make use of this, you might as well just compile for x86-64 and make use of a larger flat register space. While the idea is interesting, there doesn't seem to be any advantage to using it (and a few disadvantages, pointed out by other posters).
  • by Cerlyn ( 202990 ) on Friday October 11, 2002 @12:24PM (#4432781)

    I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.

    Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.

    The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.

    And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.

    Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.

    Please pardon the omissions; I am not presently using a gcc 3.2 machine :)

Remember: Silly is a state of Mind, Stupid is a way of Life. -- Dave Butler

Working...