Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Hardware

Understanding Pipelining and Superscalar Execution 87

Zebulon Prime writes "Hannibal over at Ars has just posted a new article on processor technology. The article uses loads of analogies and diagrams to explain the basics behind pipelining and superscalar execution, and it's actually kind of funny (for a tech article). It's billed as a basic introduction to the concepts, but as a CS student and programmer I found it really helpful. I think this article is a sequel to a previous one that was linked here a while ago."
This discussion has been archived. No new comments can be posted.

Understanding Pipelining and Superscalar Execution

Comments Filter:
  • If these two articles, along with the promised third one came along a few months ago, I could have skipped even more architecture classes and still passed. Let's hope they keep popping these out.
  • But do we really need to be notified [slashdot.org] everytime a part of this story comes out?
  • P&H - Pipelining (Score:5, Interesting)

    by minesweeper ( 580162 ) on Thursday December 19, 2002 @08:24PM (#4927242) Homepage
    from the read-hennessy-and-patterson-first dept.

    I just finished a CS course [berkeley.edu] co-taught by Professor Patterson, and our primary text this semester was Patterson and Hennessy's Computer Organization and Design [berkeley.edu].

    When we discussed pipelining this semester, the analogy used was the four stages of doing laundry: washing, drying, folding, and stashing. Here are the lecture [berkeley.edu] notes [berkeley.edu] (both PDF). The notes spend a good deal of time going over the hazards of pipelines and how to avoid them.

    • When we discussed pipelining this semester, the analogy used was the four stages of doing laundry: washing, drying, folding, and stashing.

      How clever. It's not really a difficult concept to understand, but that analogy makes it even easier.

    • aahhhh!! It's that damn book! I've had nightmares about that book! They used it here to teach MIPS asm, and then the next semester it's used to teach "bum bum bum!" cpu organisation (sp) and design.
    • Re:P&H - Pipelining (Score:5, Interesting)

      by Jucius Maximus ( 229128 ) on Thursday December 19, 2002 @08:56PM (#4927394) Journal
      "I just finished a CS course [berkeley.edu] co-taught by Professor Patterson, and our primary text this semester was Patterson and Hennessy's Computer Organization and Design [berkeley.edu]. When we discussed pipelining this semester, the analogy used was the four stages of doing laundry: washing, drying, folding, and stashing. Here are the lecture [berkeley.edu] notes [berkeley.edu] (both PDF). The notes spend a good deal of time going over the hazards of pipelines and how to avoid them."

      I just finished an engineering course with the exact same book, and we did that exact same pipelining example. I must say that the books is really very good at explaining the workings of the CPU.

      The best teacher, though, is design. As part of the course, we all formed groups and actually designed and implemeted basic non-pipelined CPUs in VHDL and, if they fit, we implemented them on FPGA (field programmable gate array) boards. And by 'simple' CPUs, I mean that the CPUs have like 4K RAM, maybe 8 registers, ran at ~3 cycles per instruction, and only have about 12 instructions total. But it was REALLY informative because it forced us to learn the exact purpose of everything in the CPU.

      • Heh...oddly enough, I ALSO just finished a course using that book (just got my grade a few minutes ago, actually...couldn't believe I got an A for that course). Word of advice...if you tend to procrastinate like I do, never, never, NEVER take a design class online..
        • "Heh...oddly enough, I ALSO just finished a course using that book (just got my grade a few minutes ago, actually...couldn't believe I got an A for that course). Word of advice...if you tend to procrastinate like I do, never, never, NEVER take a design class online.."

          What the heck? Were we in the same class? My grades were just released a few minutes ago as well! Were your prof's initials M.M.?

          • My prof was G.A., Computer Architecture, University of Colorado.

            (I don't know when my grades were actually released, but I got just back from Vegas an hour ago and found they were online when I checked :-))
            • "My prof was G.A., Computer Architecture, University of Colorado."

              OK, it was just a coincidence then. I was taking an embedded systems architecture course at a University in Canada (not Waterloo.)

    • Didn't Patterson invent the SPARC risc architecture? I believe Hennessy invented MIPS. A very good book. Hennessy is president of Stanford now too.
    • Now everyone will know why the P4 sucks :) Honestly, in order to double the pipelines length, you really need to be twice as accurate with branch prediction, or else it will be slower per clock speed. It does make it easier for a chip to go faster, to a certain point, but the P4 is a bit excessive :) I just wonder sometimes if the Intel designers have ADD (think rambus (bad idea... memory width is a good thing!), P4 (don't increase the pipeline's length unless you have a good reason), Itanitium (scrap an entire architechture for one that allows you to disable instructions, so that it is gauranteed that part of the processor won't be used at that point)).
      • Performance per clock doesn't really matter, because lengthening the pipeline allows higher frequencies. I think a better metric is maximum performance in the same process, and the Pentium 4 wins there (unless you care about cost; then it gets tricky...).
      • Re:P&H - Pipelining (Score:4, Informative)

        by cheezedawg ( 413482 ) on Friday December 20, 2002 @12:14AM (#4927964) Journal
        to a certain point, but the P4 is a bit excessive

        Actually, there is a lot of research about pipeline depths, and here [colorado.edu] is a paper that calculates the optimal pipeline for x86 to be around 50 stages. In fact, they theorize that you could see up to a 90% increase in performance in the P4 by making the pipeline even deeper. So not everybody thinks that the P4 pipeline is "a bit excessive."

        think rambus (bad idea... memory width is a good thing!)

        I'm a little confused here- until the past few months, Rambus still offered superior memory bandwidth. It wasn't until DDR333 and higher that SDRAM started to catch up. Rambus didn't lose in the market because of performance.

        Itanitium (scrap an entire architechture for one that allows you to disable instructions, so that it is gauranteed that part of the processor won't be used at that point)

        That is a pretty strange complaint about Itanium. In fact, I think that it is weird that you even think that is a problem.
        • Re:P&H - Pipelining (Score:3, Informative)

          by j3110 ( 193209 )
          A 50 stage pipeline wouldn't be bad if it were done differently. I still prefer the old "have one instruction after the branch that always is executed" method. Only with a longer pipeline, you would need more instructions. A good optimized compiler could handle this MUCH better than a processor could. Longer pipelines need better branch prediction. I don't really care about your research, because it's all theory. It's pretty appearant from comparing P3 to P4 that longer pipelines are only better if you can manage to crank out a factor of speed greater than the factor of increase in pipeline length. The whole "Software isn't compiled for P4 optimizations" arguement is really dumb when you think about it. If it can't run X86 faster, then it must really suck. In order to compensate for this, they have to run the P4 at much faster clock rates, and the compiled binaries have to be larger so as to place the jump points in strategic positions. Deeper pipelining will cause you to run into problems of silicon transistor switching speed much earlier than you would usually as well.

          Rambus is a bad idea. It had very much inferior latency that increased with each memory module you add. The processor waits everytime you ask for ram. That's wasted cycles. There is a reason the P4 has an internal bus that is VERY wide compared to other processors. DDR's memory speed is in no way related to the bus's inferiority. The faster you make the chips, the faster the processor gets the data. With RAMBUS, you have additional latencies for the same speed of RAM. They are all made from the same DRAM design, and with the same speed chips, a wider bus will be faster.

          I'm not upset that the Itanium changes instructions sets. I just don't think that any processor that disables part of itself will ever be optimal. I don't think lugging around old instruction sets are a good idea either. It's a waste of space that could have been used for some more full multipliers.

          Don't take my word for it... go clock a P4 against another CPU and see how well it performs in sorting with RAMBUS memory. The bulk of the P4's gains on any other CPU comes from SSE2, 400Mhz FSB, and CPU clock speed. Take those away, and it will be slower per clock cycle than any other CPU (including P3), especially if it has RAMBUS memory.
          • I think you are missing the point- you increase the pipeline depth so you can increase the clock speeds. Of course it will do less per cycle- that's the definition of a pipeline. But it gives you a much more scaleable design and the end result is better performance. The paper says that given the current timings and branch prediction accuracies, you can increase the performance a lot by playing with the pipeline depth and cache sizes.

            It's pretty appearant from comparing P3 to P4 that longer pipelines are only better if you can manage to crank out a factor of speed greater than the factor of increase in pipeline length.

            Look at the clock speeds of the P3 (architecture maxed out around 1.5 GHz) and P4 (3 GHz with lots of room to grow). They have "cranked up the speed" by more than enough to compensate for the lower IPC. If you are comparing a 1.5 GHz P3 to a 1.5 GHz P4 then you have missed the whole point entirely.

            Longer pipelines need better branch prediction.

            Branch prediction is already very good. Are you suggesting that you should slow down the rest of the chip so the less than 5% branches that are mispredicted won't be as expensive? That's a pretty dumb trade-off.

            Don't take my word for it... go clock a P4 against another CPU and see how well it performs in sorting with RAMBUS memory

            Ok- how about this: [tomshardware.com]

            Still, the actual Rambus technology leaves no room for complaint: RDRAM offers a large bandwidth of up to 4.2 GB/s and offers the best performance, particularly when used together with the Intel Pentium 4. ...
            Here, you should keep in mind that the Intel Pentium 4 has a maximum bandwidth of 4.2 GB/s. In the near future, this will reach well over the 3 GHz limit. Only Rambus memory in the form of PC4200 (533 MHz) is capable of taking full advantage of this bandwidth. By using DDR SDRAM, such as DDR266 or DDR333, the bandwidth remains restricted to 2.1 GB/s and 2.7 GB/s, respectively.
            With higher latencies and all (which are becoming less significant), the P4 has always and still does perform the best with Rambus.

            The bulk of the P4's gains on any other CPU comes from SSE2, 400Mhz FSB, and CPU clock speed. Take those away, and it will be slower per clock cycle than any other CPU (including P3)

            Once again, nobody disputes that the P4 is "slower" per clock cycle than other CPUs. That is why it is clocked 1 GHz higher than its competition, and it still has room to grow (current Athlon designs can't go much faster). Your statement is analogous to saying "Take away these extra french fries and larger drink from my super-sized value meal, and its just the same as a regular value meal".
            • Ok, If you don't say that the Pipeline allowed them to get 2x the clock speed (the fastest Athlon is 2.17 Ghz ATM), then I won't say that it hasn't let them crank out an extra 38% cycle speed.

              If they catch branches early on (by the 5th stage), then they will not loose much performance. But because there are 10 instructions still not complete, it's unlikely they could catch the branch until all the instructions are complete and ready to write back to ram. Consider also the super scalar nature of the CPU. Now we are sitting around the 13th stage of the pipeline and we know which way to branch. If we were wrong, we loose 12*2=24 clock cycles. If branch prediction is 90% accurate(a stretch I would say) and code is about 10% branch statements, then 1 in every 100 instructions will cause you to loose 24 clock cycles. That's only branching! Now lets take that 20% of the time, the next instruction references the result of the current instruction. That usually causes the processor to insert NOPs into the pipe just after the decode until there is enough spacing between them so that the data can be forwarded to this instruction when it is ready for it's operands. In a normal pipe, that's only about 2 clock cycles. In a P4 pipe, that has to at least be 6. This is a factor or 3 slower about 20% of the time. That's about the same for branches. A good 1/3 of the cpu time is 3x slower on a P4. Sure you can compile around some of it, but that is just silly. That's why P4 introduced hyper threading that nothing is using. So instead of inserting NOPs, it can insert instructions from other programs that can't possibly reference the same data. That would be great, but it can be applied to shorter pipes just as well.

              The majority of the performance increase of P4 has nothing to do with the pipeline. If the P4 was built on a normal pipeline, you could have expected to see a 2.2 Ghz machine right now that blew the AMD away.

              Intel had a good chance with their Itanium to make a good RISC processor. A good load-store, 4/8 way super scalar, with a short pipeline, and it could still do 3Ghz easy. Instead, they decided to graft on the kitchen sink and a million other transistors that will lie dormant half the time. Most of the CPU will be sleeping even when you need it. I'm more excited about the new IBM processors for the mac.
    • That seems to be a common analogy for pipelining, and is actually quite useful, since it extends to cases where you have two washers and a big dryer, or one washer and two slow dryers, or you're in a dorm and have a dozen of each, but you have to run the dryers several times each to get stuff dry (and then you go to work at Intel).

      Am I the only person who therefore uses the word "clock" for what you do when a step has finished?
  • Branch Prediction (Score:5, Informative)

    by hng_rval ( 631871 ) on Thursday December 19, 2002 @08:27PM (#4927261)
    One thing that his excellent analogy leaves out is the concept of branch prediction.

    For those of you who didn't major in CS...

    Imagine that we finish the first stage of building our SUV (building the engine) and commence with stage 2 (putting the engine in the stasis). While we are doing that we are building another engine for SUV #2. However, what if the next customer didn't want an SUV, but instead wanted a compact car. We have to throw away our engine for SUV #2 and start over. We wasted an entire stage!!

    This analogy doesn't work so well it seems. So we'll stick with computers. If you have 5 instructions in your pipeline and one of them is a conditional branch (think, If the user hit ENTER, print a message to the screen. If they hit escape, BSOD).

    If the conditional instruction is high up in the pipeline then every instruction under it could be wasted. Obviously, if the processor could predict which path the branch would follow it would waste less instructions.

    Branch predicting algorithms are extremely interesting. The early ones were very simple with:

    Prediction: Never take the branch
    OR
    Prediction: Always take the branch

    People soon realized that most branches were in loops, so they came up with a new algorithm

    Prediction: If the last time we were here we took the branch, take it again, otherwise don't take it. Basically, repeat what we did the last time we ran this instruction.

    IIRC there are lots of branch prediction algorithms, some of which are eerily accuratae (above 90%). Unforunately, branch prediction requires cache which takes away from the cache your programs need.
    • Re:Branch Prediction (Score:5, Informative)

      by jbrandon ( 603700 ) on Thursday December 19, 2002 @09:01PM (#4927409)

      Unforunately, branch prediction requires cache which takes away from the cache your programs need.

      This [arstechnica.com] notes that the branch prediction unit has some cache that is separate from the other cache. It also notes that the PIII BPU has the "eerily accurate" prediction success you describe.

      • Re:Branch Prediction (Score:4, Informative)

        by ottffssent ( 18387 ) on Friday December 20, 2002 @01:05AM (#4928085)
        All modern CPUs have highly accurate branch predictors. Just as caches need to catch the vast majority of memory accesses (About 98% I believe is a common number for the L1+L2 caches) for the CPU to work even remotely close to its theoretical maximum, branch predictors need to avoid pipeline flushes whenever possible. Consider the P4 with 20-odd stages, more in the FP pipeline. There could be dozens of instructions in flight by the time the branch test gets completed and says it took the wrong branch - The 20-stage pipeline needs to be emptied of invalid instructions, all of which get thrown away. A new memory location needs to be loaded, and execution needs to resume at that point. Extremely inefficient, so the occurence of this sort of thing has to be kept to an absolute minimum. Since the P4 suffers so horribly in the case of mispredicted branches, it tends to do poorly in branch-heavy code such as chess benchmarks. In order to keep such wasteful operations from happening, the CPU keeps track of thousands of previous branches in an attempt to guess correctly which way a new branch will go. There is a data structure in the CPU called the BHT, Branch History Table, that holds all this information, some times with many bits of info per branch.

        I won't take the time to dig up references to see if any CPU architecture currently does all of these tricks, but consider all the things that can be done to minimize the impact of branches:
        When a branch occurs, fetch instructions from *both* possible branch locations. Begin executing both sets of instructions in parallel, keeping the CPU's back-end busy. Flag these instructions with a "left branch" or "right branch" tag. When the branch test completes, toss out the wrong instructions and keep the good ones. Both branches will execute more slowly than a correctly-guessed branch that executes only one, but in the case of a mispredict, there is no pipeline flush, and no delay waiting for the PC to update and new instructions to flow in. Also, it's hard on the caches and RAM subsystem, since two sets of instructions rather than 1 need to be fetched.
        Build a better predictor. By analyzing the type of branch and surrounding code, the branch predictor can get eerily accurate. Way better than 90%. A K6-2 could get 90% accuracy with no sweat, and that's a pretty old chip.
        Prefetch. Grab the branch test and run it way ahead of the branch itself (when possible). If the outcome of the branch can be determined before the branch is reached (using instruction reordering trickery), there is effectively no branch at all. This can be "stacked" with other techniques as well.
        I'm sure you can come up with several more. It's an interesting problem to think about, with most techniques having a good mix of benefits and drawbacks.
        • The forthcoming IBM PowerPC 970 CPU is supposed to have a very sophisticated branch prediction unit. (I'm not sure how it compares to that of the POWER4, from which the PPC 970 was derived, or how it compares to other CPUs, though.)

          (Disclaimer: recalling all of this from memory based on the paper I wrote a few weeks ago on the PPC 970. Forgive me if I over-simply or mis-state something.)

          The PPC 970 hast three branch history tables (BHTs). Each one has 16k (2^14) entries of one bit each. One BHT follows the more or less traditional method of tracking whether or not the branch prediction from a previous execution of the instrution was successful. One BHT has its entries associated with an 11-bit vector which tracks the last 11 instructions executed by the CPU (and using this to determine if the branch prediction was successful. The third and final BHT is used to determine which BHT has been more successful for the corresonding instructions. For each individual branch instruction, the third BHT is used to determine which method has had better success in the past and then that BHT is used as the branch prediction method for this execution of the instruction.

          CyberDave
    • Just a little bit of information. I can't remember what chip it was implemented on but I do think it was the Alpha 21264 (Correct me if I'm wrong) there was a 99.99% correct prediction. Using some crazy state machines and keep track of the choice made in the past at the same juncture to help influnce the choice. Pretty cool stuff...to bad I can't find a link to sneak in here with some cool information.

  • Holy Crap! (Score:5, Funny)

    by poshgoat ( 543379 ) <rcdean@NosPaM.shaw.ca> on Thursday December 19, 2002 @08:30PM (#4927272)
    I have a final exam on this stuff tomorrow morning... It would seem there is a God...!!
    • Re:Holy Crap! (Score:2, Informative)

      Heh, I was reading slashdot the night before I had my exam on this stuff as well (which was 2 weeks ago.)

      Make sure you can identify RAW, WAW, and WAR hazards. (R == read, A == after, W == write)

  • Our SUV, the Ars Extinction LE (if you put "LE" on the end of the name you can charge more), is selling like... well, it's selling like an SUV, which means it's doing pretty well. In fact, it was awarded Car and Driver's prestigious "Ridiculously Aggressive-looking Design of the Year" award, as well as the Global Climate Coalition's "Excellence in Environmental Innovation" award for its stunningly low 0.5 mpg rating. (In case it's not obvious, GCC is an "environmental" front group for the oil industry.)

    -Cyc

  • Well I'm glad somebody understands it.
  • Great (Score:3, Funny)

    by Jenova ( 27902 ) on Thursday December 19, 2002 @08:47PM (#4927352)
    Thanks for the links people. I really need this refresher course. Pipelining and superscalar execution are stuff I havent touch for quite a while :)
  • by sparkz ( 146432 ) on Thursday December 19, 2002 @08:50PM (#4927365) Homepage
    I knew this before I did a BSc in CS; what I learned later, was the real stuff - how memory speed affects pipelining; memory has not got significantly faster in the past decade, where CPUs have gone from 30MHz to 3GHz. Therefore even more pipelines are required now, as we sit around, cycle upon cycle, waiting for memory to feed us some data.

    In fact, Alan Cox gave a talk on this recently: UMeet2002 [uninet.edu].

    • DEC designed the Alpha CPU to interface to the BUS and RAM over multiple pipelines. Their EV6 is the prime of their effort. EV7 is even better. Due to DEC's inability to attain loans and offset the high consumer cost to make Alpha more affordable, they were in a position to sell themselves to Compaq and their financial team followed. Compaq, same tradition, sold-out to Hewlet Packard and we all remember when HP's President declared HP will close its doors if the merger didn't complete. And now HP is *trying* to kill the Alpha at the ev7. Alpha is fastest in the market, and HP is also canceling its own in-house PA-RISC in favor of the much slower Intel Itanium2. Only company left with a license to produce Alpha hardware is Samsung and they've already dumped the technical details of their Alpha products from their website. This world stinks...Star Trek pulled off the air and now Alpha. Ode to Bankers for conspiracy and marketers for pushing inferior non-Alpha hardware.
  • Now, since we Ars guys are computer types and not industrial engineers, we're not too bright when it comes to making efficient use of factory resources. Also, because Ars was started back in the dot-com boom days, we're still kind of stuck in that mindset so we run a pretty chill shop with lots of free snacks, foosball tables, arcade games and other such employee perks.
    If only I could find a boss with that mindset that's working for a company that can still afford that mindset...
  • "Understanding why black text on a white background is easier to read than black text on a white background."
    • Oops, that should, of course be: "Understanding why black text on a white background is easier to read than *white* text on a *black* background."

      That'll teach me for trying to make a snide comment :-)
      • > "Understanding why black text on a white background is easier to read than *white* text on a *black* background."

        I don't understand this. Black on white is a historical remnant of ink-on-paper technology, and it has no basis in CRTs or LCDs. In fact I find black on white quite annoying, and irritating to my eyes, particularly on a CRT.

        In a reflective LCD (no backlight), black on white might make more sense. In general I think the background (dominant color) should be the "neutral state", which is black on CRTs and white/gray on reflective LCDs.

        • In general I think the background (dominant color) should be the "neutral state", which is black on CRTs and white/gray on reflective LCDs.
          I don't think the neutral state is all that important. More important is that you aren't staring at a light all day trying to pick out dark bits. Light grey on black with muted colour highlights.

          Not that this has anything to do with the Ars sight. They use white on black which is just as bad.

      • Oh good grief yeah... Whenever I tab back over to a white webpage after reading an Ars novella I suffer from heinous retinal burn. If it wasn't for the neat-o diagrams I'd just cut+paste :P
  • by Wesley Felter ( 138342 ) <wesley@felter.org> on Thursday December 19, 2002 @09:22PM (#4927489) Homepage
    Am I the only one who finds this stuff easier to understand when the author just explains what actually happens instead of using analogies? I thought the Hennessey & Patterson version of this was better, but then it wasn't free...
  • that this article isn't a hoax as well. (-:
  • At last! (Score:4, Funny)

    by mao che minh ( 611166 ) on Thursday December 19, 2002 @10:21PM (#4927653) Journal
    Thank you! After suffering many long and terrible months under an oath of involuntary celibacy, this new found knowledge in superscalar execution is sure to win me a date with one of the many "cam girl amatuers" that have been offering me free services through email for months. Thank you for restoring my confidence. Now I must learn how to convey my thoughts without using run-on sentences.
  • I'm kinda surprised that this wasn't touched in the topic of pipelining. One of the major problem's with piplenes are the hazards that may occur Read-After-Write (RAW) could really screw shit up. Anyway to make a long story short Tomasulo's Algorithim can take care of some RAW hazards and WAW and WAR hazards that could stall a processor. If your really interested in this stuff it's a worth while topic to read up on.

    Here [ed.ac.uk] is a great link if you want to visualize how this works.

  • this article would have been a lot more interesting if he'd trimmed 50% of the distracting BS, stories of Caesar hiring his relatives to play foosball, etc. I didn't say make it boring, i said that there are so many indirect references to things having nothing to do with pipelining that someone truly new to this material is going to have a hard time teasing it apart. just my opinion.
  • gorgo: *lol*
    joey: what's so funny? :)
    shh, joey is losing all sanity from lack of sleep
    'yes joey, very funny'
    Humor him :>
    -- Seen on #Debian

    - this post brought to you by the Automated Last Post Generator...

It's a naive, domestic operating system without any breeding, but I think you'll be amused by its presumption.

Working...