Forgot your password?
typodupeerror
Intel Hardware

Cliff Click's Crash Course In Modern Hardware 249

Posted by timothy
from the first-there-were-the-dinosaurs dept.
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
This discussion has been archived. No new comments can be posted.

Cliff Click's Crash Course In Modern Hardware

Comments Filter:
  • Fast forward... (Score:5, Informative)

    by LostCluster (625375) * on Thursday January 14, 2010 @07:20PM (#30772618)

    I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

  • by Sycraft-fu (314770) on Thursday January 14, 2010 @07:50PM (#30772966)

    Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.

    Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.

  • by marcansoft (727665) <hector.marcansoft@com> on Thursday January 14, 2010 @07:56PM (#30773066) Homepage

    Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".

    On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.

  • by Rockoon (1252108) on Thursday January 14, 2010 @08:11PM (#30773220)

    Using shift to multiply is often a great idea on most CPUs.

    Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

    Didn't know that? yeah... it seems like only assembly language programs know this.

  • by RzUpAnmsCwrds (262647) on Thursday January 14, 2010 @08:25PM (#30773352)

    It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.

    In contrast, PGI or ICC is much harder to beat using assembly.

  • It's not just x86 (Score:4, Informative)

    by RzUpAnmsCwrds (262647) on Thursday January 14, 2010 @08:41PM (#30773542)

    Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.

    The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

  • by Anonymous Coward on Thursday January 14, 2010 @08:50PM (#30773656)

    Or you could get NASM, which is open source :)

  • by marcansoft (727665) <hector.marcansoft@com> on Thursday January 14, 2010 @09:00PM (#30773736) Homepage

    Which CPU's are those?

    Those with a barrel shifter.

    The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

    Then someone needs to beat the GCC developers with a cluestick.
    $ cat test.c
    int main(int argc, char **argv) {
                    return 4*(unsigned int)argc;
    }
    $ gcc -march=core2 test.c -o test
    $ objdump -d test ...
    00000000004004ec <main>:
        4004ec: 55 push %rbp
        4004ed: 48 89 e5 mov %rsp,%rbp
        4004f0: 89 7d fc mov %edi,-0x4(%rbp)
        4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
        4004f7: 8b 45 fc mov -0x4(%rbp),%eax
        4004fa: c1 e0 02 shl $0x2,%eax
        4004fd: c9 leaveq
        4004fe: c3 retq
        4004ff: 90 nop

    yeah... it seems like only assembly language programs know this.

    I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.

  • Re:Fast forward... (Score:5, Informative)

    by Brian Gordon (987471) on Thursday January 14, 2010 @09:06PM (#30773782)

    A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv [thruhere.net]) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.

  • by Sycraft-fu (314770) on Thursday January 14, 2010 @09:06PM (#30773788)

    People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.

    Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.

    Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

    However, you'll still hear people say it is a great trick because they haven't updated their knowledge.

  • by tomtefar (935007) on Thursday January 14, 2010 @09:26PM (#30773974)
    I have the following sticker on top of my display: "Make it work before you make it fast!" Saved me many hours of work.
  • by Cassini2 (956052) on Thursday January 14, 2010 @09:37PM (#30774052)

    I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0.

    You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.

    Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

    Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.

    Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
    ADD AL, AH
    MOV CL, 0
    JC somewhere

    should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
    XOR CL, 0
    ADD AL, AH
    JC somewhere

    executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.

  • rule of the code (Score:3, Informative)

    by Bork (115412) on Thursday January 14, 2010 @09:38PM (#30774060) Homepage

    Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.

    The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.

    One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

  • by TheRaven64 (641858) on Thursday January 14, 2010 @09:44PM (#30774106) Journal

    One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory

    I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.

    Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.

    If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.

  • by TheRaven64 (641858) on Thursday January 14, 2010 @09:49PM (#30774146) Journal
    The calling convention is complicated, but it's nowhere near as different as IA32 calling conventions between platforms. Linux and FreeBSD, for example, use different rules for when to return a structure on the stack and when to return it in registers on IA32, but they use exactly the same conventions (the SysV ABI) on x86-64.
  • by TheRaven64 (641858) on Thursday January 14, 2010 @09:59PM (#30774238) Journal
    I actually did a benchmark of this a few months ago. For a single shift, there wasn't much in it (on a Core 2); both were decoded into the same micro-ops. For more than one shift and add, the multiply was faster because the micro-op fusion engine wasn't clever enough to reassemble the multiply (and even if it were, you're still burning i-cache for no reason). GCC used to emit shift-and-add sequences for all constant multiplies until someone benchmarked it on an Athlon (which had two multiply units and one shift unit) and found that it was much faster to just emit a multiply.
  • by SpinyNorman (33776) on Thursday January 14, 2010 @10:46PM (#30774606)

    Actually the reason us old fogies normally used XOR A, A rather than LD A, 0 wasn't because it was faster but rather because it was smaller - 1 byte rather than two bytes (instruction + immediate operand). On the old memory constrained 8-bitters, these assembly "tricks" were all about saving a byte here, another byte there...

  • by Eudial (590661) on Thursday January 14, 2010 @11:07PM (#30774750)

    I hear they have nice 0xC0FFEE

  • Re:Fast forward... (Score:4, Informative)

    by pyrrhonist (701154) on Thursday January 14, 2010 @11:42PM (#30775012)

    some poor guy's machine through a dynamic DNS service!

    Some poor guy? It's on an Amazon EC2 server!

    $ host flv.thruhere.net
    flv.thruhere.net has address 67.202.36.223
    $ host 67.202.36.223
    223.36.202.67.in-addr.arpa domain name pointer ec2-67-202-36-223.compute-1.amazonaws.com.

  • Re:Fast forward... (Score:5, Informative)

    by Brian Gordon (987471) on Thursday January 14, 2010 @11:47PM (#30775044)

    You've done it! Interested slashdotters can download the video file at this link:

    http://67.202.36.223/presentations/09-sep-JVMperformance.flv [67.202.36.223].

    Good detective work, partner!

  • by Anonymous Coward on Friday January 15, 2010 @12:23AM (#30775242)

    One of the biggest *advantages* of C/C++ as a systems language is that it gives you lots of control on how you arrange memory. You can write your allocators. You get a guaranteed layout of your structs and using the extensions the C/C++ compilers implement you *can* force alignments. You go a bit outside of the standard by using the "extensions", but you can encapsulate its use by using the preprocessor and porting is less a hassle than if you write your stuff in straight assembly.

    What the compiler cant do rearrange the data automagically so that it runs faster, so it is the programmer who has to think about that... just as in assembly (but in a more comfortable way)

  • Re:Fast forward... (Score:4, Informative)

    by iammani (1392285) on Friday January 15, 2010 @12:30AM (#30775284)
    Mirror available at http://www.mediafire.com/?j21t2ynnnzn [mediafire.com]

    And please stop hitting the server liked in GP's post. The poor server hardly sustains 30 KB/s.
  • Re:Fast forward... (Score:4, Informative)

    by iammani (1392285) on Friday January 15, 2010 @12:39AM (#30775328)
    Or type http://www.infoq.com/resource/presentations/click-crash-course-modern-hardware/en/slides/1.swf [infoq.com]

    And change the the number at the end to change slides

    PS: I am no good in javascript, but the above, in FF 3.5/Linux, just displayed a page saying "true"
  • Re:rule of the code (Score:3, Informative)

    by XMunkki (533952) on Friday January 15, 2010 @03:52AM (#30776204) Homepage
    I agree that many low-level programming methods aren't that necessary anyhow, but there is one big point where the compiler cannot help much, and that is data layout. Big hits come from all levels of cache misses, and it's good for the programmer to be aware of this and benchmark the memory access patterns and try to make them good (predictable, linear, clumping frequently used data, etc). Also on some hardwares, the Load-Hit-Stores are something to be aware as well. A reasonable thing to do, when optimizing something, is to fiddle with the code a bit and see what generates the best assembly. This usually is a good compromise (you still stay at a higher level and got portable code with some gained performance on at least one platform). Now still compilers aren't a magic wand everywhere, especially when going to deeply embededded or specialized hardware. One example is SPU programming. Since SPUs read&write everything from/to 16-aligned addresses, current GCC compiler lots of "align ptr, load, rotate, calculate, rotate, combine, store" sequences. If you want good SPU performance, going into ASM is indeed viable something. Though most of the times, staying at intrinsic functions gives you an adequate compromise. But since SPUs are basically fast DSPs, many of the tasks that are ran by them are in nature quite repetitive with short amount of work per item and millions of items (like doing vertex transforms, simulating some post processing effect, mixing audio etc). But a good programmer always benchmarks first, checks the compiler output etc before hitting the deck with raw assembly.
  • Re:rule of the code (Score:2, Informative)

    by SorcererX (818515) on Friday January 15, 2010 @05:23AM (#30776624) Homepage
    You got the fastest time simply by playing with the compiler flags? We had a similar problem where we had to do a matrix multiplication on symmetric matrices for C = AB^T+BA^T (rank2k update with alpha=1.0, beta=0.0) and there was nothing the compiler could do for us to get even remotely near good scores. Doing the simplest implementation we got about 5 FLOPS/cycle on an 8 core system, optimizing just with SSE etc, I got it up to about 13 FLOPS/cycle, and by splitting up the matrix in tiny parts to avoid cache trashing etc I was able to get it up to 47 FLOPS/cycle. For comparison Intel's MKL library managed about 85 FLOPS/cycle on the same hardware. I believe the best in my class was about 50 FLOPS/cycle, and it took an insane amount of fiddling for any of us to get above 25-30 FLOPS/cycle or so. That said, most things done on a computer is rarely that limited by memory access, and then the compiler does an awesome job :)
  • Re:rule of the code (Score:3, Informative)

    by wirelessbuzzers (552513) on Friday January 15, 2010 @05:41AM (#30776696)

    One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

    Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.

  • by cheekyboy (598084) on Friday January 15, 2010 @05:50AM (#30776760) Homepage Journal

    intel compilers have options to optimize to more than one target, and its runtime engine uses code that was made for X cpu. Sure your binary is larger, but everyone is happy.

  • by TheRaven64 (641858) on Friday January 15, 2010 @08:52AM (#30777694) Journal
    The GCC manual tells you everything you need to know. First you declare a vector type, so if you want four shorts representing an RGBA colour value , you declare a type like this:

    typedef short colour_t __attribute__ ((vector_size (4 * sizeof(short))));

    This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:

    colour_t a = {1,2,3,4};
    colour_t b = {1,2,3,4};
    colour_t c = a + b;

    In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.

    You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.

    Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.

    I wrote a longer explanation a while ago [informit.com].

  • by Rockoon (1252108) on Friday January 15, 2010 @09:02AM (#30777752)
    GCC is a big offender, thats true.

    This is one of the reasons that GCC sucks compared to ICC and VC++.

    Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.

    If this was the entire story, then they would be equal. Buts its not the entire story.

    The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.

    "But Rockoon.. they are the same latency anyways, right?" .. yes, in isolation. But that read/modify/write cycle on the flags register prevents a hell of a lot of out-of-order execution.

    Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.

    In some code, it wont make any discernible difference, but in other code it will make a big difference.

    As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.

    There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.

    One is a straight multiplication (MASM syntax, CDECL):

    main:
    mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
    imul eax, edx, 4
    ret

    The other is leveraging the LEA instruction (MASM syntax, CDECL):

    main:
    mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
    lea eax, [eax * 4]
    ret

    The alternative LEA version on some processors (P4..), in isolation, is slower .. but it has the advantage that it uses different execution units on those very same processors, so might pair better with other stuff in the pipeline, and it doesnt touch the flags register at all.

    GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.
  • by chelberg (1712998) on Friday January 15, 2010 @12:02PM (#30779620)

    Two books to look at Cormen et. al Intro to Algorithms, and Bentley's Programming Pearls. The second is more practical, the former is used in many CS Algorithms courses.

Often statistics are used as a drunken man uses lampposts -- for support rather than illumination.

Working...