Forgot your password?
Intel Hardware

Cliff Click's Crash Course In Modern Hardware 249

Posted by timothy
from the first-there-were-the-dinosaurs dept.
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
This discussion has been archived. No new comments can be posted.

Cliff Click's Crash Course In Modern Hardware

Comments Filter:
  • Fast forward... (Score:5, Informative)

    by LostCluster (625375) * on Thursday January 14, 2010 @07:20PM (#30772618)

    I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

  • That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.

    • by RightSaidFred99 (874576) on Thursday January 14, 2010 @07:44PM (#30772906)
      And messy and embarrassing. Oh, wait...
    • If (and only if!)

      Compiler Error: Numerous Syntax Errors.
      Line 1, 4; Object Expected
      Line 1, 15; '(' Expected
      Line 1, 16; Condition Expected
      Line 1, 17; 'Then' Expected

    • by marcansoft (727665) <> on Thursday January 14, 2010 @08:00PM (#30773114) Homepage

      Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.

      • by Rockoon (1252108) on Thursday January 14, 2010 @08:11PM (#30773220)

        Using shift to multiply is often a great idea on most CPUs.

        Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

        Didn't know that? yeah... it seems like only assembly language programs know this.

        • by marcansoft (727665) <> on Thursday January 14, 2010 @09:00PM (#30773736) Homepage

          Which CPU's are those?

          Those with a barrel shifter.

          The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

          Then someone needs to beat the GCC developers with a cluestick.
          $ cat test.c
          int main(int argc, char **argv) {
                          return 4*(unsigned int)argc;
          $ gcc -march=core2 test.c -o test
          $ objdump -d test ...
          00000000004004ec <main>:
              4004ec: 55 push %rbp
              4004ed: 48 89 e5 mov %rsp,%rbp
              4004f0: 89 7d fc mov %edi,-0x4(%rbp)
              4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
              4004f7: 8b 45 fc mov -0x4(%rbp),%eax
              4004fa: c1 e0 02 shl $0x2,%eax
              4004fd: c9 leaveq
              4004fe: c3 retq
              4004ff: 90 nop

          yeah... it seems like only assembly language programs know this.

          I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.

          • Re: (Score:3, Insightful)

            by AuMatar (183847)

            It depends on where they spend their hardware, and what you're multiplying by. You can make a multiplier faster than shifting, it just requires a lot of hardware to do so. If you're multiplying by a constant power of 2, shifting will always be as fast or faster. If you're multiplying by a non power of 2 constant, shifting and adding may be faster, and probably is if there's fairly few 1s in the binary representation. But if they have a good multiplier then mult may be faster than shift/add for a random

            • I was talking of multiplying by a power of two constant, of course. You're quite correct in saying that shift+add combinations may or may not be faster than multiplying by more complex constants, depending on the particular implementation. Usually, two shifts and one add is a fairly safe bet for simpler CPUs, but it can actually slow things down on modern superscalar CPUs where it creates undesirable dependencies in the pipeline.

              • Re: (Score:3, Informative)

                by TheRaven64 (641858)
                I actually did a benchmark of this a few months ago. For a single shift, there wasn't much in it (on a Core 2); both were decoded into the same micro-ops. For more than one shift and add, the multiply was faster because the micro-op fusion engine wasn't clever enough to reassemble the multiply (and even if it were, you're still burning i-cache for no reason). GCC used to emit shift-and-add sequences for all constant multiplies until someone benchmarked it on an Athlon (which had two multiply units and on
          • by smash (1351)
            Does that code change if you use the arch flags for GCC to generate AMD64 or at least i686 code?

            Not taking the piss... i have no idea - i just noticed you didn't use any architecture specific flags so its no doubt defaulted to dumb but compatible code?

          • by Rockoon (1252108) on Friday January 15, 2010 @09:02AM (#30777752)
            GCC is a big offender, thats true.

            This is one of the reasons that GCC sucks compared to ICC and VC++.

            Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.

            If this was the entire story, then they would be equal. Buts its not the entire story.

            The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.

            "But Rockoon.. they are the same latency anyways, right?" .. yes, in isolation. But that read/modify/write cycle on the flags register prevents a hell of a lot of out-of-order execution.

            Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.

            In some code, it wont make any discernible difference, but in other code it will make a big difference.

            As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.

            There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.

            One is a straight multiplication (MASM syntax, CDECL):

            mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
            imul eax, edx, 4

            The other is leveraging the LEA instruction (MASM syntax, CDECL):

            mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
            lea eax, [eax * 4]

            The alternative LEA version on some processors (P4..), in isolation, is slower .. but it has the advantage that it uses different execution units on those very same processors, so might pair better with other stuff in the pipeline, and it doesnt touch the flags register at all.

            GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.
        • People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.

          Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register

          • Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

            I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0. Because they want existing code to run as fast as possible, and in x86 compatibility-is-kin

            • by Cassini2 (956052) on Thursday January 14, 2010 @09:37PM (#30774052)

              I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0.

              You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.

              Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

              Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.

              Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
              ADD AL, AH
              MOV CL, 0
              JC somewhere

              should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
              XOR CL, 0
              ADD AL, AH
              JC somewhere

              executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.

          • Re: (Score:3, Informative)

            by SpinyNorman (33776)

            Actually the reason us old fogies normally used XOR A, A rather than LD A, 0 wasn't because it was faster but rather because it was smaller - 1 byte rather than two bytes (instruction + immediate operand). On the old memory constrained 8-bitters, these assembly "tricks" were all about saving a byte here, another byte there...

            • by BZ (40346) on Friday January 15, 2010 @12:06AM (#30775154)

              The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.

              Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).

      • so there's no reason to explicitly use shift in code (unless you're doing bit manipulation

        Well, right. The general advice is to always write what you actually want the compiler to do and not how to do it, unless you have specific proof that the compiler's not optimizing it well.

      • Using shift to multiply is often a great idea on most CPUs.

        In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:

        printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
        printf(" 3 << 1

    • by AuMatar (183847)

      The opposite problem also exists though- by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite. Saying optimize at the end is just as stupid and just as costly. Learning when to care about what level is part of the art of programming. (Although on your specific examples I'll agree with you- especially since I would expect anything but a really old compiler to do mult->shift conversions for you, so you may as well use the more

      • Saying optimize at the end is just as stupid and just as costly.

        There is an enormous difference between optimization and choosing appropriate algorithms. If you write a program well, it's almost always easy to optimize it later. If you write it poorly, it'll almost always be impossible to optimize at any point of its development. For example, I'd rather sort a big array with an unoptimized (but correct) quicksort than with an extremely clever (but insane) bogosort.

      • Re: (Score:3, Interesting)

        by smash (1351)

        by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite.

        "Not thinking about performance" is different from writing in high level first.

        Get the algorithm right first, THEN optimise hot spots.

        Starting out with ASM makes it a lot more time consuming/difficult to get many different algorithms written, debugged and tested. The time you spend doing that is time better spent testing/developing a better algorithm. Only once you get the

    • Re: (Score:3, Informative)

      by tomtefar (935007)
      I have the following sticker on top of my display: "Make it work before you make it fast!" Saved me many hours of work.
    • Re: (Score:2, Interesting)

      by Anonymous Coward

      I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.

      However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data st

      • Interesting anecdote that has nothing to do with optimization and everything to do with bad design. Optimization is great for making your program run n% faster. Design is great for making your program run in O(log n) time instead of O(n^2) time. The important part is to come up with a good design, implement it, and address the specific problem areas. I can't think of a single justification for doing it any other way.

      • by smash (1351)
        On the contrary, i'd be more concerned that the medical software is CORRECT. You can throw more hardware at the problem to make it faster. You can't throw more hardware at the problem to correct bugs.
      • by kc8apf (89233) <kc8apf.kc8apf@net> on Friday January 15, 2010 @03:12AM (#30776028) Homepage

        Having spent 4 years being one of the primary developers of Apple's main performance analysis tools (CHUD, not Instruments) and having helped developers from nearly every field imaginable tune their applications for performance, I can honestly say that regardless of your performance criteria, you shouldn't be doing anything special for optimization when you first write a program. Some thought should be given to the architecture and overall data flow of the program and how that design might have some high-level performance limits, but certainly no code should be written using explicit vector operations and all loops should be written for clarity. Scalability by partitioning the work is one of those items that can generally be incorporated into the program's architecture if the program lends itself to it, but most other performance-related changes depend on specific usage cases. Trying to guess those while writing the application logic relies solely on intuition which is usually wrong.

        After you've written and debugged the application, profiling and tracing is the prime way for finding _where_ to do optimization. Your experiences have been tainted by the poor quality of tools known by the larger OSS community, but many good tools are free (as in beer) for many OSes (Shark for OS X as an example) while others cost a bit (VTune for Linux or Windows). Even large, complex multi-threaded programs can be profiled and tuned with decent profilers. I know for a fact that Shark is used to tune large applications such as Photoshop, Final Cut Pro, Mathematica, and basically every application, daemon, and framework included in OS X.

        What do you do if there really isn't much of a hotspot? Quake 3 was an example where the time was spread out over many C++ methods so no one hotspot really showed up. Using features available in the better profiling tools, the collected samples could be attributed up the stack to the actual algorithms instead of things like simple accessors. Once you do that, the problems become much more obvious.

        What do you do after the application has been written and a major performance problem is found that would require an architectural change? Well, you change the architecture. The reason for not doing it during the initial design is that predicting performance issues is near impossible even for those of us who have spent years doing it as a full time job. Sure, you have to throw away some code or revisit the design to fix the performance issues, but that's a normal part of software design. You try an approach, find out why it won't work, and use that knowledge to come up with a new approach.

        That largest failing I see from my experiences have been the lack of understanding by management and engineers that performance is a very iterative part of software design and that it happens late in the game. Frequently, schedules get set without consideration for the amount of time required to do performance analysis, let alone optimization. Then you have all the engineers who either try to optimize everything they encounter and end up wasting lots of time, or they do the initial implementation and never do any profiling.

        Ultimately, if you try to build performance into a design very early, you end up with a big, messy, unmaintainable code base that isn't actually all that fast. If you build the design cleanly and then optimize the sections that actually need it, you have a most maintainable code base that meets the requirements. Be the latter.

    • by epine (68316) on Friday January 15, 2010 @08:15AM (#30777476)

      That's the main reason why I want to shoot people who write "clever" code on the first pass.

      Over the years, I've grown to hate this meme. Not because it isn't right, but because it stops ten floors below the penthouse of human potential.

      First of all, it's an incredible instance of cultural drift. In the mid 1980s, when this meme was halfway current, I worked on adding support for Asian characters to an Asian-made PC. On the "make it right" pass it took 15s to update the screen after pressing the page down key, and this from assembly language. Slower than YouTube over 300 baud. It was doing a lot of pixel swizzling it shouldn't have been, because the fonts were supplied in a format better suited to printing. This was an order of magnitude below an invitation to whiffle-ball training camp. This was Lance Armstrong during his chemotherapy years towing a baby trailer. Today you get 60fps with a 100 thousand or a 100 million polygons, I've sort of lost track.

      Let's not shunt performance onto the side track of irrelevancy. While there's no good excuse, ever, for writing faulty code, an enlightened balance between starting out with an approach you can live with, and exploiting necessary cleverness *within your ability* goes a long way.

      How about we update Knuth's arthritic maxim? Don't tweak what you don't grok. If you grok, use your judgement. Exploit your human potential. Live a little.

      The books I've been reading lately about the evolution of skills in the work place suggest that painstaking reductive work processes are on their way to India. Job security in home world is greatly enhanced if you can navigate multiple agendas in tandem, exploiting more of that judgement thing.

      One of the reasons Carmack became so successful is that he didn't waste his effort looking for excuses to deprive his co-workers of their oxygen bits. Instead he conducted shrewd excursions along the edge of the envelope in pursuit of the sweet spot between cleverness too oppressive to live with, and no performance at all.

      In my day of deprecating my elders, I always knew where the pea was hidden under the mattress. These days, there are so many squishy mattresses stacked one upon the other, I have to plan my work day with a step ladder. Which I think is what this unwatchable cult-encoded video is on about: the ankle level view most of us never see any more.

      Here's another thing. I've you're going to be clever about how you code something, also be clever about how you do it. In other words, be equally clever all levels of the solution process simultaneously: algorithm selection, implementation, commenting, software engineering, documentation, and unit test. Knuth got away with TeX, barely, for precisely this reason. Because of his cleverness, the extension to handle Asian languages was far from elegant. Because of his cleverness (in making everything else run extremely well), people actually wanted to extend TeX to handle Asian languages. So who's to say he was wrong? Despite his cleverness, he managed to keep his booboo score in single or low double digits. His bug tracking database fit nicely on an index card.

      In the modern era, people quote the old "make it right before you make it faster" as the cure for the halitosis of ineptitude: you're feeble and irritating, so practice your social graces. Don't make me come over there and choke off your oxygen bit. It's a long ways from saying "you have a lot of human potential, and not much experience, so let me help you confront the challenges in a meaningful way". These sayings leak a lot of sentiment about social engagement.

      Every so often I have to pull up a chair beside a junior resource and go "Dude, you're jousting at windmills here, let's roll that change back and try again. I know you can do better." Five minutes of war stories about how to shoot yourself in the foot six ways from Sunday is usually enough to rebalance the flywheel of self preservation.

  • It's not just x86 (Score:4, Informative)

    by RzUpAnmsCwrds (262647) on Thursday January 14, 2010 @08:41PM (#30773542)

    Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.

    The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

  • I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.

  • This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...

  • rule of the code (Score:3, Informative)

    by Bork (115412) on Thursday January 14, 2010 @09:38PM (#30774060) Homepage

    Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.

    The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.

    One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

    • Re: (Score:3, Informative)

      by XMunkki (533952)
      I agree that many low-level programming methods aren't that necessary anyhow, but there is one big point where the compiler cannot help much, and that is data layout. Big hits come from all levels of cache misses, and it's good for the programmer to be aware of this and benchmark the memory access patterns and try to make them good (predictable, linear, clumping frequently used data, etc). Also on some hardwares, the Load-Hit-Stores are something to be aware as well. A reasonable thing to do, when optimizin
    • Re: (Score:3, Informative)

      One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

      Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.

  • by podom (139468) on Friday January 15, 2010 @01:26AM (#30775570) Homepage

    I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).

    I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.

    But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.

  • by Mr Z (6791) on Friday January 15, 2010 @11:09AM (#30779052) Homepage Journal

    That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.

    One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.

    On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper [] (slides []) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.

There are worse things in life than death. Have you ever spent an evening with an insurance salesman? -- Woody Allen