Posted
by
CowboyNeal
from the more-power-now dept.
NickSD writes "ChipGeek has an interesting article on increasing x86
CPU performance without having to redesign or throw out the x86 instruction set. Check it out at
geek.com."
This discussion has been archived.
No new comments can be posted.
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
by Anonymous Coward writes:
on Friday October 11, 2002 @09:41AM (#4431970)
Cache is a huge Intel problem. 20K [geek.com] L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.
AMD has 128K L1 since the original Athlon [geek.com], and had 24K in the K5 [geek.com].
The Transmeta 3200 [geek.com] and the Motorola G4 [geek.com] both have 96K, the UltraSparc-III [geek.com] has 100K, Alpha [geek.com] had 128K when it died, and HP's PA-8500 [geek.com] has a whopping 1.5MB.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.
Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.
A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.
Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?
In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million [slashdot.org], not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.
For a graphical depiction of the difference between Intel and AMD cache performances, try this link: http://www.tech-report.com/reviews/2002q1/n orthwoo d-vs-2000/index.x?pg=3 It was the first think that came up in a google search for linpack and "cache size".
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache. No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math. Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache. Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%. Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.
Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:
1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.
2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.
3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.
That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.
Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.
FYI If you are interested in competant, intelligent, technical reviews of hardware, you might like www.aceshardware.com
Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
dramatically increasing program execution speed several orders of magnitude
Where did you read this?
Also, even with the hardware bottlenecks the Anonymous Coward mentions?
You ask yourself "why not?"... I ask can only ask myself "how?":-) Sure, I see how it's supposed to work in theory with zero bottlenecks, but how it works in practice is a completely different thing.
Intel is constantly adding new commands and register to the CPU, this is the whole point of the article, so it can easily do it to greatly increase execution speed of ALL programs, not just a few!!!
I read the article. From what I can see this guy writes lots of assembly, but knows very little about how processors are designed. The huge gains you all see have already been made by register renaming and caches. There might be some gain left by giving the compiler direct control over these, but at the cost of much complexity in the register renaming hardware.
The P4 has a very deep pipeline. Looking for register conflicts is hard enough without adding another layer of redirection.
The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.
This may boil down to the generic do it in hardware v.s. do it in software debate. Do we reorder the instructions in hardware (ala Pentium and Athlon), or make the compiler do it (ala Itanium)? Do we make the hardware predict branches or have the compiler drop hints? Register renaming as done by modern RISC-core x86 implementations likely address many of the issues he proposes an extension and a smart compiler (or assembler) would solve.
Now, a 386, that would benefit from his technique.
However, if we're going to revise that architecture, I say we add MMX and call it a 486. Then, we can add SSE and call it a Pentium.. And then,...
Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
Register renaming already does what's being proposed here, but transparently.
Well, not exactly. Renaming takes care of the case where two things write, say, EAX, by allowing both to go with different physical registers. I.E. you don't have to stall because you only have one architected "EAX" register.
However, your program is still limited to only 8 visible values at any instant. So when you need a 9th -thing- to keep around, you have to spill some registers onto the stack. Register renaming doesn't solve this problem.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....;-)
What Intel is currently doing is putting a turbo on an old and obsolete architecture.
By having more GP registers, you could make the same job more easily and with better performances (and easier to read if you code in ASM). As it is now, you need to many memory access for simple operations. With more registers, you would need less clock speed.
With more registers, you would need less clock speed.
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.
At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.
You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.
We must not forget that most operations processor does are data movements and not calculations.
All three x86 problems which are described by article author are fixed with IA-64 architecture, but not so with AMD's x86-64.
And you don't understand how modern processors really work. First, they have several levels of cache memory exactly because you don't want to go out to main memory too often. Second, they have many more registers than the assembly instructions can see: register renaming, speculative eeecution and all those tricks reorganize instructions so that the CPU core doesn't really have to move data back and forth between memory and registers so often.
IANACPUD (I'm not a CPU designer) so I'm not going to try to descripe this stuff further, but articles abound. Here are some:
Into the K7, Part One [arstechnica.com] and Into the K7, Part Two [arstechnica.com]
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
No.
Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.
The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.
After that, it doesn't hurt to make the bus bandwidth bigger.
That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!
Other new commands: LIE Launch IE LMW Launch MS Word LME Launch MS Excel LMO Launch MS Outlook LMOV Launch MS Outlook Virus LCNR Launch Clippy for No Reason DPRN Display Pr0n SPOP Show IE Popup SPU Spam User SHDR Send Hard Drive Contents to Redmond RBT Reboot SBS Show Blue Screen
Real people used stuff like jmp $fce2 for the first, but the latter was a little bit more complex because of the blue part: lda #$06 ; sta $d020 ; sta $d021 ; hlt (of course, hlt is an undocumented opcode, and since C64 boots in less than a second from ROM, it hardly is as frustrating as the bluescreen in Windows).
yeah, registers are expensive dude. More registers equals more money. Could it be... Faster chips more expensive?? If I want faster for more money there is already a product for me. It's called a Xeon.
by Anonymous Coward writes:
on Friday October 11, 2002 @07:47AM (#4431275)
Buy Intel's C/C++ compiler (icc) and download the high performance, Intel CPU optimized math libraries from Intel's site.
The compiler does for you exactly what the article says. It uses MMX and other extensions as well as vectorizes loops and does interprocedural optimization and OpenMP.
It doesn't come as a surprise Intel has already thought of this before going into simulating dual processrors via HyperThreading, NetBursting and several other advanced techniques to improve performance.
'Boy this instruction set would be better at the same clock-speed'
'All they'd have to do is update their verilog code and run it thru synthesis'
Well they don't make processors from Verilog code; they'd be huge hot and slow. All this increased complexity he wants would dictate more transistors and a bigger die (along with lots've development time).
With the above said, it still might be a good idea; I don't know.
i designed cpu architecture for an undergrad class last spring. i'm familiar with assembly as well as architecture including pipelining and all that, and i'm not convinced that this solution is all that much of an improvement.
like the previous poster said, you have to add more hardware. if you want this register-mapping to not take a long time, you need to add the new registers, reconnect the general purpose register address lines through a translator to these registers, and then translate what you get out of there to select the register it's mapped to. this isn't really that big of a deal compared to how much is already on these chips, but somebody has to design it and make it work--this guy didn't do that and i'm pretty sure he's more of a programmer than a hardware guy.
as for the speed improvement this would provide, i don't think it's as good as he thinks it is. while he mentions that the whole pipeline has to be paused for any instructions that change his mapping registers, he doesn't seem to realize how big of an impact it is to have only one instruction in the pipeline at a time. if you change your register map every other instruction, you've pretty much thrown out any benefit you may have had from a pipeline. this too could be worked around, but it means that every assembly programmer who wants to use these.x instructions would need to understand the effect it has on the pipeline if they actually want to get more speed out of it.
also, since this involves changing the architecture of the chip, none of these.x instructions will work on any chips that are already out there. what happens if you try to do a couple div.x instructions and find out that it used edx:eax for both when that's not what you wanted? if it get's put in the chips it'll be a while before it's used, imo.
besides all that, didn't anybody ever tell this guy that 8 (which is actually 6) general purpose registers ought to be enough for anybody?
So this guy wants to make registers virtual... won't this add a lot of silicon to each register, and make every register access slower? Every input that takes a register would need to become a switch instead of just a solid connection.
From the software side of things, this sounds great... I just wonder how much it would slow down the hardware side.
Of course I'm no chip designer, but neither was this guy.
Registers are already virtual. It is called register renaming and is necessary to gain good speed-ups for superscalar processing(executing more than one instruction at a time).
The question is if his register-mapping can play well with normal register-renaming. I think it would be trouble currently there are only two layers. The physical register the processor sees and the virtual ones that it exports to the programmer. If this gets added there would be the "real" physical registers the processor map out, the virtual physical register the programmer map out, and the virtual virtual registers that are actually used in normal code.
This article says (in a very long-winded way) that he wants to implement something like register renaming for the x86. Register renaming is a common parallel processing technique that gets more parallelism out of code by easing the limitations imposed by the number of registers a user has. One thing that Chipgeek says is that it would require special instructions, etc. This seems sort of backwards because classic renaming techniques are handled automagically by the processor for you. In his case, he is trying to make it explicit in order to allow all registers to be used general-purpose style. I'm not so sure about how worthwhile this is because it will (obviously) require recompiled code for the new extensions - and one of the things that's holding x86 back is binary compatibility.
They already do this internally -- they have a very large number of registers that get aliased to the regular 10 registers - they use some for calculating branch speculations (and they swap in the registers from whichever set of speculative execution track actualy happend). They also switch alaised register sets for every context switch.
What this guy wants is a way to have user-level control over the register alaises, and it might not be a bad idea, but I dont think he'll see as much gain from it as he expects, since there is lots of magic going on behind the scenes already with register alaising -- I'm guessing that if he just had seperate processes both using registers intelligently, that he could get as much done as he could with a single process and more registers. Since the cost of context switch is already aleviated, there wouldn't be much overhead. The only overhead there would be would be in the parallelizable algorithms themselves. However, we should already be wanting to do that work to take advantage of SMP...
Hmm so he wants to give the compiler control of the some of the automagic optimizations that 'normal' CPUs use these days... I think I've heard this one before... Oh yeah its called VLIW... er... EPIC and it has given us the wonderous ITANIC... er... Itanium.
Rick should have patened these ideas or sold them to either AMD or Intel!
I also noticed the ad on geek.com was for job-geek... maybe someone at AMD or Intel should consider giving Rick a job! (but he appears to be intelligent enough, so he probably already has one):)
If you have any questions, please feel free to e-mail me.
Maybe "they" will... there's gotta be just a couple people from Intel or AMD that read/.
by Anonymous Coward writes:
on Friday October 11, 2002 @07:53AM (#4431306)
Computer speed is like money. People have got this idea that endly amounts of it will increase our contentment. The problem is there is more to it all than that... What most computer users want to use computers for (internet, chatting, email, typing, solitare) should be able to be done quite well on a CPU from the 1980s. But instead, there are viruses, awful UIs, vastly bloated software and an enduser facing a constant battle to either use the computer or pay the money to buy a Mac;) Human society is dumping vast amounts of resources on buying new computers, upgrading, and developing ever faster CPUs without actually making damn good designed systems which does what it is meant to, doesn't break down, is easy to use, is cheap, and lasts for ages without problems. As is usual, our priorities are messed up.
I totally agree with you.
There are applications where you NEED this king of speed (research, databases, web-porn industy). As I've worked in 2 out of 3, it is nice to have a computer that can do all of that at home. And it is resonably priced too.
It sounds like a pretty decent idea to me. Granted, I'm no assembly expert (I'm just now in my Microprocessors class, which is based on the Z80), but I don't see how having more registers could be a bad thing. Anything that keeps operations there inside the CPU rather than going out to memory would pretty much have to be faster. I especially like the fact that he's implemented it such that no current code would be affected. THAT is a key point right there.
Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...
But yeah. Cool idea, well thought out. Petition for Intel, anyone?
Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
I don't think anyone would disagree with that, but that's not the issue. What he's saying is, given that we've got to stick with x86 for historical and commercial reasons, this would be a relatively quick and easy way to allow the compilers to produce *much* groovier code.
If we're going to stick to the x86 we still do not want to add complexity. I also tried to point out how easy it would be to move to a new architecture.
As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)
RTFA or nicely put...read the article. By adding the instructions he reduced the complexity of shifts, the multiple ordered instructions it takes to do one thing, and increases the visibility of all the registers. There are added instructions, but the benefit is reduced complexity in assembly instructions due to greater direct accessibility of all the registers.
umm, an intel cpu pretty much beats the pants off anything else on the market. On the downside, it's pretty tought to stuff 134 p4's in a server the way you can with a sparc or a powerpc.
I was talking about absolute performance of a single intel chip versus anything else on the market. Not performance per penny. Perhaps you should have read my post more closely, nowhere did I mention cost.
No, RISC isn't inherently faster than CISC (and no, the P4 isn't a VLIW/RISC hybrid, it's a CISC processor with micro-code).
And both Intel and AMD spend much more on (x86-) processor development than IBM and Motorola and Sun and all others on their chips.
And no, x86 is not much faster. Not even at SPEC, which does not tell the whole picture.
As for AMD being faster, they basically had a stroke of luck with the Athlon design. Before that AMD wasn't known for their speedy processors (cheap yes). And if it hadn't been for the Athlon, Intel's x86 also wouldn't be that far (or not so actualy) ahead, the Itanium II would be the contender to the big RISCs, and the fastest Pentium 4 would be at 2 GHz (if that much) and would cost $1000.
It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).
Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
It would definitely be nice to get rid of the legacy cruft and move to a different architecture, however I doubt that this will happen until Intel and AMD start hitting major stumbling blocks. The itertia just seems to great. From what I hear (sorry I don't have a source, but I think I heard it in my Computer Architecture class), the cores of the current x86 chips are essentially RISC, and have a translation layer wrapped around it (convert x86 instructions into the internal RISC instructions).
You are right that the moder x86 implementations are RISCs with a translation layer around them (except Crusoe which is a VLIW with software translation - much cooler 8P ). Now just imagine if we could get direct access to those highly optimized RISC cores instead of having to code in x86 machine code.
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching...
But a lot of the code running today wasn't "written today" if you know what I mean. The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.
Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.
Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.
You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).
All that said, go look at the SPEC CInt2000 [specbench.org] and FP2000 results [specbench.org]. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?
Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.
Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.
Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?
RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
Not entirely true. RISC instruction sets can be quite huge too. And the whole idea of RISC is to take the complexity out of the hardware and put it into the compiler instead. It is easier to optimize for x86 than RISC.
When I started looking at the ARM chips I wondered why we ever used x86's etc.
RISC / CISC is really a misnomer.
RISC has plenty of instructions, and it's meant to be super-scaler.
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386
Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).
There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.
AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).
Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.
The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.
If you want lots of general purpose registers, take a look at Knuth's MMIX [stanford.edu] system. Unfortunately, it's not in silicon, but it's there, and it/could/ be done, if someone wanted to . ..
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp.)
I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.
When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
This is what I call the big problem. That design is utterly abominable. We live in a world where it's nothing to have 1 gigabyte of RAM in a computer. We have 80 GB hard drive platters now, allowing even greater-sized drives. And yet at the heart of every single one of your x86 computers out there, a mere 6 GP registers are doing nearly all of the processing. It's amazing. And it's something I've personally wrestled with every day of my assembly programming career.
This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.
The upside is being able to run old software on new hardware. You don't want to break too many things.
Microchannel was the bus you are thinking about. It actually was very good, but wan't backward compatible with ISA. EISA was the "rest of the industry's" response to provide a 32-bit bus that was backwards compatible. It wasn't a very good implementation since it was still locked at 8MHz.
As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.
Anyone else remember the horrors of all those damn control files on floppies?
There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)
I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
The guy does not realize that what he proposed is not at all simple to implement in silico.
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.
Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
While the base idea is interesting (add instructions that support using the multimedia registers as GP registers), I suspect that actually implementing the functionality of the GP registers in the multimedia ones could result in a prohibitively expensive CPU.
Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.
I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.
Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.
So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?
The whole gist of the article has to do with the x86's lack of general purpose
registers. While this is true, you're not going to solve all of the x86
shortcomings simply by figuring out a way to add more of them. There are
MANY things wrong with the x86 design; GP registers are just one of them.
There's an entire section in the famous Patterson [amazon.com]
book that goes into all of the issues in much more detail than I care to state
here.
Besides, there's already more efficient (albiet complex) solutions to extend
registers that make much more sense in the current world of pipelined
processors. Register
renaming [pcguide.com] is one such example.
It's interesting to hear "revolutionizing performance" in the same topic as instruction level fiddling. The only way to give truly "revolutionizing" performance is to do high level optimizations.
When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts.:-)
Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.
From what I gathered in the article, it seems like he is proposing a scheme by which normally unused registers (MMX, etc) can be used as general purpose registers. To do this, he considers an aliasing system. My question is, why can't a x86 programmer today just use those MMX registers for more general purposes? I'm sure there's a good reason, I just can't figure it out from the article - thanks
I found the article intriguing, but during the entire verbose, self-important sounding read, I was wondering how ISRs would be handled. For example, if the RMC were set to revert to the default mapping in three ops, and an ISR interrupted after the first op, would it revert to the default mapping in the middle of the ISR?
Fortunately, that issue is addressed in his Message Parlor [geek.com]. The full text of his response to BritGeek follows:
Presently the registers are saved automatically by the processor in something called a Task State Segment (TSS) during a task switch. There are currently unused portions of TSS which could be utilized and (sic) for RM and RMC during a task switch.
The PUSHRMC and POPRMC instructions are available for explicit saves/restores of the RM and RMC registers in general code. I don't recommend it, however. The decoders would be physically stalled until the RM/RMC registers are re-populated. It would be better to use explicit MOVRMCs in general code.
From the sounds of the article, he wants to make register mappings more logical than virtual. My knowledge of assembly level programming is pretty basic, but I do agree that adding more GP registers would probably increase performance measureably.
His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!
I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?
...And the best part is that I believe this is something that could be implemented in hardware in a manner which could be resolved and entirely applied during the instruction decode phase, thereby never passing the added assembly instructions any further down the instruction pipeline, and thereby not increasing the number of clock cycles required to process any instruction. I can provide technical details on how that would work to anyone interested. Please e-mail me if you are....
If this is really acomplishable without wasting *any* extra cpu time (that waste would aply to *all* instructions the CPU goes through!) this is indeed a good stunt that could work out to add a substancial ooomph to x86 performance with the code we have today. Thank god, 'cuz' my Athlon is to hot allready and I'm kinda sceptical about watercooling.:-) Then again, that's a big "if".
I do not like the changes proposed although x86 is awfully flawed (not enough GP registers, terribly overloaded instruction set {anyone ever used BCD commands? -- Yes, I hear the loud "We do" from the COBOL corner.}, you name it... ).
But this change would:
Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)
Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...)
Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.
I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.
Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.
Do you really think investers give a shit about registers?
The only potential downfall I see in this design is the possible pipeline stall seen when RM/RMC have to be populated from stack data. When that happens, no assembly instructions can be decoded until the POPRMC instruction completes and RM/RMC are loaded with the values from the stack.
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything,
Intel should stop adding instructions and start adding registers.
Oops. Forgot about PUSHA/POPA. Kind of strange, too, because I use these a lot.
Also, about the opcode problem - adding registers doesn't necessarily mean adding opcodes. For example, IBM mainframes have one opcode for a load register instruction, and the registers are specified in the instruction. Were IBM to double the number of registers, the opcode would not have to change (granted, the instruction would get longer because they only allocated enough space in the source and destination fields for specifying one of 16 registers.) The problem is with the way x86 opcodes work - they aren't as universal, that is, the opcode's first byte is a function of both the operation and the register used. So expansion would be pretty difficult, unless they expanded the instruction set to include two byte opcodes (which they've already done, iirc), and use general purpose opcodes for common operations such as loading and storing.
It's unfortunate, but true.
The real, and only solution, is that these companies get their acts together, quit issuing refreshes of old hardware, and finally give us their next gen chips to play with. Proposing anything else is just pointless. (Unless, of course, the new CPUs completely flop..)
Couldn't agree with you more. What I would really like to see is an x86 processor that could handle IBM mainframe instructions. The IBM mainframe instruction set makes a lot more sense than Intel's instruction set - unlike Intel, IBM realized that someday they might be doing 64 bit and 128 bit computing, and designed the instruction set to be expandable. Also, they don't have a lot of "garbage" instructions - no MMX, no SSE, no SIMD junk to clutter up a good design. To be honest, benchmarks that I've run on real-world software indicate that today's x86 processors complete 4 instructions for every 5 clock cycles. Which indicates that branch prediction and deep pipelines aren't the performance enhancers that Intel and AMD seem to believe them to be. While they might work well in theory, real world performance speaks otherwise. Given this, I don't see any practical reason for keeping a kludgy instruction set around, because the complexity of the instruction set has been a great hindrance to the actual, rather than the theoretical, optimization of x86 processors.
The part that confuses me is that, since code would need to be recompiled to make use of this, you might as well just compile for x86-64 and make use of a larger flat register space. While the idea is interesting, there doesn't seem to be any advantage to using it (and a few disadvantages, pointed out by other posters).
I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.
Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.
The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.
And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.
Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.
Please pardon the omissions; I am not presently using a gcc 3.2 machine:)
I thought of a context switch (or possibly a function call) too. Correct me if I am wrong, but what you are trying to do is to create a bunch of registers (my understanding being they will just be the existing x86+MMX+SSE unnamed), and "map" them via another register that certain software knows how to access, correct? That way, when an application knows about these, it can "squirrel" data away in "hidden" registers for fast access later?
The primary problem I have with this "switching" of registers is that registers are supposed to be the fastest, most reliable memory components in a computer. By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability. Furthermore, the amount of data that can be hidden away inside of a processor is limited. While hiding registers is nice, perhaps it would be better to have the ability to "latch" a row of data so it won't be cleared out of the L1 cache (no processor can do this at the moment?). I would think that this would be much easier to implement without speed degredation, as it would only require a few additional gates used during lookup/overwriting of the L1 cache (which ideally, for this case, is at least semi-associative (i.e. any memory "block" can map to at least two locations in the cache)).
Secondly, your proposal (as I understand it) would require all the registers to share the same area on a chip. Nowadays, the MMU, Arthmatic/Logic unit, etc., each have their own area on the chip. Shared/swapped registers would have to be in the center of the chip, with longer lines to each partial unit (yielding delays and capacitance). I belive you proposed doing this by subunits though; this would reduce delays somewhat, but you are still requiring some centralization, and adding a signifcant delay in.
My personal position on this still kind of stands; if a program's compiler knows how to make use of the MMX & SSE functions of a computer, it should be set up to do so. That way, after an initial context switch for the entire program, the program (being correctly configured for a processor) flys.
A compiler with register renaming functionality ("gcc3.2 -frename-registers", for example), can help do this for apps where the programmer does not know assembler. And if your "minimum requirements" mention a Pentium II 500, don't compile for a 486!
In short, I fail to see how your proposal will speed up most applications significantly. Context-switches are always expensive, but the ability to change contexts 10 clocks versus 30 really isn't significant when your backside bus is less than 50% of the processor's speed.
Obviously, being a minor player, I have my views, and I have to respect yours (especially since I only had about 5-10 minutes to read your piece), but personally, I really do not see why program accessable context switching inside a processor is needed.
what a load of drivel....Do I really need to point out everything that's wrong with this "idea"?
Actually, would you point some things out? I like Slashdot because it can act as a bullshit filter. So when I read an article about x86 assembler technique at 8 frickin 30 in the morning, maybe someone's post will help me understand whether to bother to try to understand the article. Or my previous sentence.
he tries to improve the horrendously flawed x86 ISA by adding an additional layer of complexity. Now, not only do we have the inane segmented memory space, but we have a segmented register set as well.
he claims that all this can be done in the decode stage without any impact on clock cycle time. This simply isn't possible: either the clock cycle has to get longer or we need to add a new pipeline stage, either way the processor slows down.
the real problem with x86 isn't the lack of registers (it's a problem, but not the biggest one) but, rather, the abominable decode rules and the fact that every instruction can incur multiple fault conditions. The only way to address these issues is to get rid of the x86 ISA all together, not to add some bizzare register encoding scheme to the pot.
If he really wanted to improve the x86, he could have added either a new processor mode, in which a rational ISA was used, or an instruction prefix that changed the interpretation of the mod reg r/m bits to address more registers. Either would be a better solution than the crazy tripe he has dreamed up, though not nearly as good as simply dumping the entire stinking pile that is the x86 ISA and starting over with a nice, clean RISC design.
Of course, my real complaint wasn't simply that his proposed "enhancement" to x86 was unsound, but that Slashdot actually took this seriously. The article is rambling, disjointed and generally incoherent. It doesn't deserve any serious consideration. The fact that Slashdot gave it the slightest credence simply implies that Slashdot doesn't deserve to be taken seriously either.
The fact that Slashdot gave it the slightest credence simply implies that Slashdot doesn't deserve to be taken seriously either.
You have a low UID and you still don't get Slashdot? These kind of articles are posted all of the time, but then we get good comments like yours pointing out why they are bunk.
Consider it like geek peer review. And thanks for your comments.
>"MMX/3d stuff for CPUs are lame, we have 3d cards for that."
good luck doing scientific calculations on a Geforce:)
OTOH
>"add a FPGA matrix of 4096x4096 transistors or >something on the side of the cpu for custom UBER fast routines"
^^^^ that idea has me intrigued, anyone who actually knows more about FPGA's than me (which isn't difficult) want to go into pluses/negs with that concept?
http://www.ee.ualberta.ca/~elliott/cram/ is your ultimate parallel compute machine. It turns your entire memory (all the CRAM anyways) into a register set. It is based on the concept of rather than bringing the data to the CPU for the computation the CPU is brought to the memory.
Small computational units AND/OR/Adder are included on the bit access lines for all the memory cells.
good luck doing scientific calculations on a Geforce Wha? Someone didn't tell you 3D accelerators do lots of math that requires very intensive scientific calculations, even if there implementation isn't the most accurate results. Infact, much of the math they use is used by physicist, engineers, and mathematicians everyday. Unfortunalty, getting the information in a way so as to permenatly store it, or know what the exact results are could be quite difficult other then seeing it as graphics on your screen. BTW, I do think that that CPU's still need to have great ability to do computationaly expensive instructions. There is enough math in the form of collision detection and game physics amoung other things to still need lots of processing power on the cpu.
Well, that and the fact that the guys name has (almost) the same initials [as the mnemonic he chose]
In the early 1980s, William D. Mensch designed the 65c816 (and its little brother 65c802) microprocessor to be compatible with software written for 65c02 processors. He added a couple new addressing modes for SP-relative addressing and a whole bunch of new instructions to account for the 16-bit registers in both processors and the 24-bit address bus in the 65c816. He also reserved instructions 'COP' (COProcessor access) and 'WDM' (WiDe Math) for use on a future 32-bit 65c832 processor. Of course, you can see an alternate expansion for WDM...
If there was any sense to this comment, the x86 would have proved such a disaster it was abandoned ten years ago. Many people think it should have been, that its continued existence is some bizarre aberration of rational forces.
In actual fact, the ugliness of the duckling was less of an impediment than advertised.
There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.
Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.
Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.
Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.
If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.
What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.
Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.
On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).
Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?
Why? (Score:4, Insightful)
Cache is the key (Score:3, Insightful)
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
Re:Cache is the key (Score:5, Informative)
AMD has 128K L1 since the original Athlon [geek.com], and had 24K in the K5 [geek.com].
The Transmeta 3200 [geek.com] and the Motorola G4 [geek.com] both have 96K, the UltraSparc-III [geek.com] has 100K, Alpha [geek.com] had 128K when it died, and HP's PA-8500 [geek.com] has a whopping 1.5MB.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
Re:Cache is the key (Score:5, Insightful)
Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.
A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.
Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?
In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million [slashdot.org], not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
Re:Cache is the key (Score:4, Informative)
For a graphical depiction of the difference between Intel and AMD cache performances, try this link:
http://www.tech-report.com/reviews/2002q1/
It was the first think that came up in a google search for linpack and "cache size".
Re:Cache is the key (Score:5, Insightful)
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math.
Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache.
Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%.
Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.
Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:
1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.
2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.
3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.
That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.
Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.
FYI
If you are interested in competant, intelligent, technical reviews of hardware, you might like
www.aceshardware.com
Re:Why? (Score:3, Interesting)
Re:Why? (Score:2)
Where did you read this?
Also, even with the hardware bottlenecks the Anonymous Coward mentions?
You ask yourself "why not?"... I ask can only ask myself "how?"
Re:Why? (Score:4, Informative)
Re:Why? (Score:5, Interesting)
The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.
Re:Why? (Score:3, Interesting)
However, if we're going to revise that architecture, I say we add MMX and call it a 486. Then, we can add SSE and call it a Pentium.. And then,
Oh, wait. nevermind.
Re:Why? (Score:5, Informative)
Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
Re:Why? (Score:3, Informative)
Well, not exactly. Renaming takes care of the case where two things write, say, EAX, by allowing both to go with different physical registers. I.E. you don't have to stall because you only have one architected "EAX" register.
However, your program is still limited to only 8 visible values at any instant. So when you need a 9th -thing- to keep around, you have to spill some registers onto the stack. Register renaming doesn't solve this problem.
His idea would, but it's still a stupid idea.
Re:Why? (Score:5, Informative)
Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.
When programmers try to be architects... (Score:5, Informative)
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
Re:Why? (Score:5, Insightful)
It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.
More than 3 answers !FREE! (Score:4, Insightful)
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....
Re:Why? (Score:3, Insightful)
What Intel is currently doing is putting a turbo on an old and obsolete architecture.
By having more GP registers, you could make the same job more easily and with better performances (and easier to read if you code in ASM).
As it is now, you need to many memory access for simple operations.
With more registers, you would need less clock speed.
It's not all about MHz's.
Re:Why? (Score:5, Insightful)
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Re:Why? (Score:5, Interesting)
At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.
Re:Why? (Score:2, Informative)
We must not forget that most operations processor does are data movements and not calculations.
All three x86 problems which are described by article author are fixed with IA-64 architecture, but not so with AMD's x86-64.
Re:Why? (Score:3, Interesting)
And you don't understand how modern processors really work. First, they have several levels of cache memory exactly because you don't want to go out to main memory too often. Second, they have many more registers than the assembly instructions can see: register renaming, speculative eeecution and all those tricks reorganize instructions so that the CPU core doesn't really have to move data back and forth between memory and registers so often.
IANACPUD (I'm not a CPU designer) so I'm not going to try to descripe this stuff further, but articles abound. Here are some: Into the K7, Part One [arstechnica.com] and Into the K7, Part Two [arstechnica.com]
Re:Why? (Score:5, Insightful)
No. Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.
The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.
After that, it doesn't hurt to make the bus bandwidth bigger.
DivX! Sweet! (Score:4, Funny)
Re:DivX! Sweet! (Score:5, Funny)
LIE Launch IE
LMW Launch MS Word
LME Launch MS Excel
LMO Launch MS Outlook
LMOV Launch MS Outlook Virus
LCNR Launch Clippy for No Reason
DPRN Display Pr0n
SPOP Show IE Popup
SPU Spam User
SHDR Send Hard Drive Contents to Redmond
RBT Reboot
SBS Show Blue Screen
Re:DivX! Sweet! (Score:3, Funny)
Argh, get this CISC rubbish out of my sight!
Real people used stuff like jmp $fce2 for the first, but the latter was a little bit more complex because of the blue part: lda #$06 ; sta $d020 ; sta $d021 ; hlt (of course, hlt is an undocumented opcode, and since C64 boots in less than a second from ROM, it hardly is as frustrating as the bluescreen in Windows).
=)
Re:DivX! Sweet! (Score:5, Funny)
LPS - Launch Photoshop
DGB - Do Gaussian Blur
ES - Encode Sorenson
DS - Decode Sorenson
CSAWEF - Create Switch Ad With Ellen Feiss
And my personal favorite:
BICPUWPBIGBASE - Beat Intel CPU With Proprietary Benchmark Involving Gaussian Blurs And Sorenson Encoding
[insert witty comment here] (Score:2, Funny)
Re:An attempt to insert the witty comment! (Score:2)
It's called overclocking.
Woo-hoo, I know what I'm doing Saturday! (Score:3, Funny)
Re:Woo-hoo, I know what I'm doing Saturday! (Score:2, Insightful)
Simple solution (Score:5, Informative)
The compiler does for you exactly what the article says. It uses MMX and other extensions as well as vectorizes loops and does interprocedural optimization and OpenMP.
Re:Simple solution (Score:2)
It doesn't come as a surprise Intel has already thought of this before going into simulating dual processrors via HyperThreading, NetBursting and several other advanced techniques to improve performance.
Silicon complexity (Score:4, Informative)
'Boy this instruction set would be better at the same clock-speed'
'All they'd have to do is update their verilog code and run it thru synthesis'
Well they don't make processors from Verilog code; they'd be huge hot and slow. All this increased complexity he wants would dictate more transistors and a bigger die (along with lots've development time).
With the above said, it still might be a good idea; I don't know.
Re:Silicon complexity (Score:3, Interesting)
Um... yeah.. right (Score:4, Interesting)
From the software side of things, this sounds great... I just wonder how much it would slow down the hardware side.
Of course I'm no chip designer, but neither was this guy.
Re:Um... yeah.. right (Score:4, Informative)
The question is if his register-mapping can play well with normal register-renaming. I think it would be trouble currently there are only two layers. The physical register the processor sees and the virtual ones that it exports to the programmer. If this gets added there would be the "real" physical registers the processor map out, the virtual physical register the programmer map out, and the virtual virtual registers that are actually used in normal code.
Register Renaming (Score:5, Interesting)
Re:Register Renaming (Score:5, Informative)
What this guy wants is a way to have user-level control over the register alaises, and it might not be a bad idea, but I dont think he'll see as much gain from it as he expects, since there is lots of magic going on behind the scenes already with register alaising -- I'm guessing that if he just had seperate processes both using registers intelligently, that he could get as much done as he could with a single process and more registers. Since the cost of context switch is already aleviated, there wouldn't be much overhead. The only overhead there would be would be in the parallelizable algorithms themselves. However, we should already be wanting to do that work to take advantage of SMP...
Re:Register Renaming (Score:4, Funny)
Patent! (Score:2, Funny)
I also noticed the ad on geek.com was for job-geek
Maybe "they" will
Speed speed speed (Score:5, Insightful)
As is usual, our priorities are messed up.
Re:Speed speed speed (Score:2, Insightful)
Sounds like an interesting idea... (Score:2, Interesting)
Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...
But yeah. Cool idea, well thought out. Petition for Intel, anyone?
Mark Erikson
Re:Sounds like an interesting idea... (Score:2, Funny)
"Hey I'm currently in my 2nd year at college, but what the heck I think I'm qualified to commment here"
"I think Intel need to employ this guy, I mean they must have overlooked this"
"Cool - I wonder if I could think of something like this"
Don't worry - you will.
RISC (Score:5, Interesting)
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
Re:RISC (Score:2, Insightful)
Re:RISC (Score:3, Insightful)
As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)
Re:RISC (Score:3, Informative)
JOhn
Re:RISC (Score:2)
Re:RISC (Score:2)
Re:RISC (Score:3, Interesting)
And both Intel and AMD spend much more on (x86-) processor development than IBM and Motorola and Sun and all others on their chips.
And no, x86 is not much faster. Not even at SPEC, which does not tell the whole picture.
As for AMD being faster, they basically had a stroke of luck with the Athlon design. Before that AMD wasn't known for their speedy processors (cheap yes). And if it hadn't been for the Athlon, Intel's x86 also wouldn't be that far (or not so actualy) ahead, the Itanium II would be the contender to the big RISCs, and the fastest Pentium 4 would be at 2 GHz (if that much) and would cost $1000.
Amen, brother (Score:3, Insightful)
Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
Re:RISC (Score:2, Interesting)
Re:RISC (Score:2)
Re:Switching Architectures (Score:5, Insightful)
The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
Re:RISC (Score:5, Interesting)
Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.
Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.
Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.
You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).
All that said, go look at the SPEC CInt2000 [specbench.org] and FP2000 results [specbench.org]. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?
Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.
Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.
Re:modular chips (Score:5, Informative)
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?
Re:RISC (Score:2, Insightful)
Re:RISC (Score:5, Informative)
I myself am an old x86 Assembly hacker.
When I started looking at the ARM chips I wondered why we ever used x86's etc.
RISC / CISC is really a misnomer.
RISC has plenty of instructions, and it's meant to be super-scaler.
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386
Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
The article should really just be called:
"An Acedemic Exercise in Register Gymnastics"
Re:RISC (Score:5, Informative)
There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.
AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).
Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.
The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.
Re:RISC (Score:3, Interesting)
himi
Um, how is this anything new? (Score:4, Informative)
(On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)
This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.
Another Hideous Hack for IA32 (Score:5, Informative)
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327
Mmmm, Assembler... (Score:5, Funny)
When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
The Problems of Obsolete design (Score:5, Interesting)
This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.
The upside is being able to run old software on new hardware. You don't want to break too many things.
Re:The Problems of Obsolete design (Score:3, Insightful)
Re:The Problems of Obsolete design (Score:5, Informative)
Anyone else remember the horrors of all those damn control files on floppies?
There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
Full Circle ... (Score:3, Insightful)
I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
Does anyone else have flashbacks to (Score:4, Interesting)
Well, not quite, but it has the same flavor.
After working in x86 assembly, I really appreciated high level and minimally complex languages like C.
Technical point of view (Score:4, Interesting)
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.
Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
I suspect this would be a rather expensive chip (Score:5, Interesting)
Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.
I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.
Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.
So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?
More registers are not enough. (Score:4, Informative)
Besides, there's already more efficient (albiet complex) solutions to extend registers that make much more sense in the current world of pipelined processors. Register renaming [pcguide.com] is one such example.
Revolutionizing?? (Score:3, Interesting)
When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts.
Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.
Question about register aliasing (Score:2)
It's the Chipset That Wouldn't Die! (Score:2)
Cool idea (Score:2, Funny)
I'll call my company transmeta!
Or in the words of that new dell commercial
"Sure we'll call it 1-800 they already do that!".
Tom
Wow... Maybe I am more L33T than I thought I was? (Score:2)
Jack William Bell
What about Interrupt Handlers? (Score:2, Interesting)
Fortunately, that issue is addressed in his Message Parlor [geek.com]. The full text of his response to BritGeek follows:
He may be onto something afterall...
x86 Emulator? (Score:2, Interesting)
His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!
I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?
-Shadow
The tricky part: (Score:2)
If this is really acomplishable without wasting *any* extra cpu time (that waste would aply to *all* instructions the CPU goes through!) this is indeed a good stunt that could work out to add a substancial ooomph to x86 performance with the code we have today.
Thank god, 'cuz' my Athlon is to hot allready and I'm kinda sceptical about watercooling.
Then again, that's a big "if".
Great... (Score:2, Interesting)
A segment architecture for memory wasn't nasty enough, now we want to have a segment register for the registers?
Thanks, no.
Comment removed (Score:3, Insightful)
Why should one do that? (Score:4, Informative)
But this change would:
Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)
Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...) Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.
Intel isn't interested in performance (Score:3, Insightful)
Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.
Do you really think investers give a shit about registers?
--Marketing 101
More trouble than its worth... (Score:4, Insightful)
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything, Intel should stop adding instructions and start adding registers.
Re:More trouble than its worth... (Score:3, Interesting)
Oops. Forgot about PUSHA/POPA. Kind of strange, too, because I use these a lot.
Also, about the opcode problem - adding registers doesn't necessarily mean adding opcodes. For example, IBM mainframes have one opcode for a load register instruction, and the registers are specified in the instruction. Were IBM to double the number of registers, the opcode would not have to change (granted, the instruction would get longer because they only allocated enough space in the source and destination fields for specifying one of 16 registers.) The problem is with the way x86 opcodes work - they aren't as universal, that is, the opcode's first byte is a function of both the operation and the register used. So expansion would be pretty difficult, unless they expanded the instruction set to include two byte opcodes (which they've already done, iirc), and use general purpose opcodes for common operations such as loading and storing.
It's unfortunate, but true.
The real, and only solution, is that these companies get their acts together, quit issuing refreshes of old hardware, and finally give us their next gen chips to play with. Proposing anything else is just pointless. (Unless, of course, the new CPUs completely flop..)
Couldn't agree with you more. What I would really like to see is an x86 processor that could handle IBM mainframe instructions. The IBM mainframe instruction set makes a lot more sense than Intel's instruction set - unlike Intel, IBM realized that someday they might be doing 64 bit and 128 bit computing, and designed the instruction set to be expandable. Also, they don't have a lot of "garbage" instructions - no MMX, no SSE, no SIMD junk to clutter up a good design. To be honest, benchmarks that I've run on real-world software indicate that today's x86 processors complete 4 instructions for every 5 clock cycles. Which indicates that branch prediction and deep pipelines aren't the performance enhancers that Intel and AMD seem to believe them to be. While they might work well in theory, real world performance speaks otherwise. Given this, I don't see any practical reason for keeping a kludgy instruction set around, because the complexity of the instruction set has been a great hindrance to the actual, rather than the theoretical, optimization of x86 processors.
If it needs a recompile, what's the point? (Score:3, Interesting)
An intelligent comment on the subject (Score:4, Interesting)
I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.
Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.
The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.
And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.
Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.
Please pardon the omissions; I am not presently using a gcc 3.2 machine :)
Re:An intelligent comment on the subject (Score:3, Interesting)
I thought of a context switch (or possibly a function call) too. Correct me if I am wrong, but what you are trying to do is to create a bunch of registers (my understanding being they will just be the existing x86+MMX+SSE unnamed), and "map" them via another register that certain software knows how to access, correct? That way, when an application knows about these, it can "squirrel" data away in "hidden" registers for fast access later?
The primary problem I have with this "switching" of registers is that registers are supposed to be the fastest, most reliable memory components in a computer. By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability. Furthermore, the amount of data that can be hidden away inside of a processor is limited. While hiding registers is nice, perhaps it would be better to have the ability to "latch" a row of data so it won't be cleared out of the L1 cache (no processor can do this at the moment?). I would think that this would be much easier to implement without speed degredation, as it would only require a few additional gates used during lookup/overwriting of the L1 cache (which ideally, for this case, is at least semi-associative (i.e. any memory "block" can map to at least two locations in the cache)).
Secondly, your proposal (as I understand it) would require all the registers to share the same area on a chip. Nowadays, the MMU, Arthmatic/Logic unit, etc., each have their own area on the chip. Shared/swapped registers would have to be in the center of the chip, with longer lines to each partial unit (yielding delays and capacitance). I belive you proposed doing this by subunits though; this would reduce delays somewhat, but you are still requiring some centralization, and adding a signifcant delay in.
My personal position on this still kind of stands; if a program's compiler knows how to make use of the MMX & SSE functions of a computer, it should be set up to do so. That way, after an initial context switch for the entire program, the program (being correctly configured for a processor) flys. A compiler with register renaming functionality ("gcc3.2 -frename-registers", for example), can help do this for apps where the programmer does not know assembler. And if your "minimum requirements" mention a Pentium II 500, don't compile for a 486!
In short, I fail to see how your proposal will speed up most applications significantly. Context-switches are always expensive, but the ability to change contexts 10 clocks versus 30 really isn't significant when your backside bus is less than 50% of the processor's speed.
Obviously, being a minor player, I have my views, and I have to respect yours (especially since I only had about 5-10 minutes to read your piece), but personally, I really do not see why program accessable context switching inside a processor is needed.
Surely you jest! (Score:3, Funny)
Re:hooey (Score:2, Insightful)
Actually, would you point some things out? I like Slashdot because it can act as a bullshit filter. So when I read an article about x86 assembler technique at 8 frickin 30 in the morning, maybe someone's post will help me understand whether to bother to try to understand the article. Or my previous sentence.
Re:hooey (Score:5, Interesting)
Of course, my real complaint wasn't simply that his proposed "enhancement" to x86 was unsound, but that Slashdot actually took this seriously. The article is rambling, disjointed and generally incoherent. It doesn't deserve any serious consideration. The fact that Slashdot gave it the slightest credence simply implies that Slashdot doesn't deserve to be taken seriously either.
Re:hooey (Score:3, Insightful)
You have a low UID and you still don't get Slashdot? These kind of articles are posted all of the time, but then we get good comments like yours pointing out why they are bunk.
Consider it like geek peer review. And thanks for your comments.
Re:add core funcs libc/stdc++ to the CPU (Score:2)
good luck doing scientific calculations on a Geforce
OTOH
>"add a FPGA matrix of 4096x4096 transistors or >something on the side of the cpu for custom UBER fast routines"
^^^^ that idea has me intrigued, anyone who actually knows more about FPGA's than me (which isn't difficult) want to go into pluses/negs with that concept?
CRAM: advances in microprocessor arch (Score:3, Interesting)
http://www.ee.ualberta.ca/~elliott/cram/
is your ultimate parallel compute machine. It turns your entire memory (all the CRAM anyways) into a register set. It is based on the concept of rather than bringing the data to the CPU for the computation the CPU is brought to the memory.
Small computational units AND/OR/Adder are included on the bit access lines for all the memory cells.
Re:add core funcs libc/stdc++ to the CPU (Score:2, Interesting)
Re:add core funcs libc/stdc++ to the CPU (Score:2)
and if it were for something like a 20 hour 3d render, would it matter if the initial setup took a while?
COP and WDM on the 65c816 (Score:5, Interesting)
Well, that and the fact that the guys name has (almost) the same initials [as the mnemonic he chose]
In the early 1980s, William D. Mensch designed the 65c816 (and its little brother 65c802) microprocessor to be compatible with software written for 65c02 processors. He added a couple new addressing modes for SP-relative addressing and a whole bunch of new instructions to account for the 16-bit registers in both processors and the 24-bit address bus in the 65c816. He also reserved instructions 'COP' (COProcessor access) and 'WDM' (WiDe Math) for use on a future 32-bit 65c832 processor. Of course, you can see an alternate expansion for WDM...
too much of a good thing = pie wagon (Score:3, Insightful)
In actual fact, the ugliness of the duckling was less of an impediment than advertised.
There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.
Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.
Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.
Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.
If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.
What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.
Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.
On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).
Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?