Forgot your password?
typodupeerror
Intel Hardware

Cliff Click's Crash Course In Modern Hardware 249

Posted by timothy
from the first-there-were-the-dinosaurs dept.
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
This discussion has been archived. No new comments can be posted.

Cliff Click's Crash Course In Modern Hardware

Comments Filter:
  • by creimer (824291) on Thursday January 14, 2010 @06:34PM (#30772784) Homepage
    I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.
  • by Com2Kid (142006) <com2kidSPAMLESS@gmail.com> on Thursday January 14, 2010 @06:44PM (#30772902) Homepage Journal

    Also, the compiler doesn't always take advantage of instructions that it could use.

    Yah sorry about that. :)

    Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since that is what everyone is using, and no one wants to move to the new architecture since the compiler doesn't take full advantage of it.

  • by Chris Burke (6130) on Thursday January 14, 2010 @06:46PM (#30772928) Homepage

    That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

    Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.

    However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

    Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.

  • by dave562 (969951) on Thursday January 14, 2010 @06:58PM (#30773092) Journal

    I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into what might be going wrong when their code isn't compiling or executing the way they want it to. It also gives them the tools to make their code better, as opposed to simply shrugging and saying, "I sure hope they fix this damn compiler..."

  • Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.

  • by KC1P (907742) on Thursday January 14, 2010 @07:17PM (#30773272) Homepage

    That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!

    Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly programmer, which a lot of people are but they don't realize it so they bad-mouth the language) is just icing on the cake. You should definitely teach yourself assembly, if you can find the time.

  • by dr2chase (653338) on Thursday January 14, 2010 @07:21PM (#30773324) Homepage

    Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).

    Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.

  • by Anonymous Coward on Thursday January 14, 2010 @08:45PM (#30774118)

    I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.

    However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data structures so that you can pass the data directly into some FFT functions without packing/unpacking the data to some other format that the rest of the functions were written to expect. It may also mean that your design scale to many cores and that inner loops be heavily optimized and vectorized. Of course, all of that code should be performance tested during development against the simpler versions.

    Profiling after the fact sounds like a good idea, but what if the code has no real "hotspot"? What if you find out that you need to redesign the entire software framework to support zero-copy processing of the data? Also, profiling tools in general are really not that good. Running oprofile on a large-scale application with dozens of threads and data source dependencies on other processes can be less than enlightening. gprof is entirely useless for non-trivial applications. cachegrind is sometimes helpful, but most people working on performance optimization seem to simply build their own timers based on the rdtsc instruction and manually time sections of the code.

    I work on software for processing medical device data and performance is often critical. You probably want an image display to update very quickly when it is providing feedback to the doctor guiding a catheter toward your heart, for example. We had one project where the team decided to start over with a clean framework without concern for performance -- they would profile and optimize once everything was working. They followed the advice of many a software engineer: their framework was very nice, replete with design patterns and applications of generic programming, and entirely unscalable beyond a single processor core. There were no performance tests done during development, and of course the timeline was such that there would only be minimal time for optimization once the functionality was complete. The software that it was replacing was ugly, but also scaled nicely to many cores. The software shipped on a system with two quad-core processors, just as it had before.

    Let's just say that customers were unimpressed with the new software framework.

  • by TheRaven64 (641858) on Thursday January 14, 2010 @08:47PM (#30774128) Journal
    Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.
  • by SETIGuy (33768) on Thursday January 14, 2010 @09:34PM (#30774502) Homepage

    Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.

    Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.

    In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.

    Being able to use %r2 for the same stuff you use %r1 for is just boring.

  • by perpenso (1613749) on Thursday January 14, 2010 @09:52PM (#30774630)

    Using shift to multiply is often a great idea on most CPUs.

    In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:

    printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
    printf(" 3 << 1 + 1 = %d\n", 3 << 1 + 1);
    printf("(3 << 1) + 1 = %d\n", (3 << 1) + 1);

    3 * 2 + 1 = 7
    3 << 1 + 1 = 12
    (3 << 1) + 1 = 7

    --
    Perpenso Calc [perpenso.com] for iPhone and iPod touch, scientific and bill/tip calculator, fractions, complex numbers, RPN

  • by smash (1351) on Thursday January 14, 2010 @10:01PM (#30774698) Homepage Journal

    by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite.

    "Not thinking about performance" is different from writing in high level first.

    Get the algorithm right first, THEN optimise hot spots.

    Starting out with ASM makes it a lot more time consuming/difficult to get many different algorithms written, debugged and tested. The time you spend doing that is time better spent testing/developing a better algorithm. Only once you get the algorithm correct should you break out the assembler for the hotspots WITHIN that algorithm.

    If you're writing such shitty code that its "impossible to optimize later" then I don't think starting out in ASM will help you. You'll just have slightly faster shitty code.

  • by BZ (40346) on Thursday January 14, 2010 @11:06PM (#30775154)

    The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.

    Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).

  • by podom (139468) on Friday January 15, 2010 @12:26AM (#30775570) Homepage

    I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).

    I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.

    But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.

  • by Anonymous Coward on Friday January 15, 2010 @04:07AM (#30776562)

    " As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own. Of course all that is predicated on using a profiler first to find out where the actual problem is." - by Sycraft-fu (314770) on Thursday January 14, @06:50PM (#30772966)

    Exactly, & I tend to use a very "old-school/primitive" method of "hand-rolled profiling" (in using hi-res multimedia timers registered with the OS to do so, in order to find the 'slow spots' in my methods/subroutines/procedures/functions), &, it works (to @ least spot those areas, just as you noted).

    HOWEVER: I only took my program, noted below earlier here today also, down from roughly a 4.5 hour runtime, down to a 4 hour runtime using programmatic optimizations of varying kinds! Not a bad gain, especially by optimizing (but that took a lot of time to determine mind you, & that matters in the workplace of course).

    The program's one for my personal use here though, &, it's for:

    ----

    A.) Sorting data alphabetically in datasets/records in HOSTS files

    B.) Removing duplicated entries

    C.) Pings+ resolved DOMAIN/HOST Names to IP Addresses of fav. sites I go to to add into the HOSTS file as to their correct IP-to-DOMAIN/HOST name resolution (faster doing it from a local HOSTS file than calling out to a potentially compromized & SLOWER external DNS server by far)

    D.) Changing the preceeding blocking address used on domain/host names in a HOSTS file from the larger & slower 127.0.0.1 or 0.0.0.0 blocking addresses to the smaller/faster 0 one (& vice-a-versa IF NEED BE too)

    ----

    Anyhow/anyways, now that the background of what my app does is covered?

    Well - I did so via BOTH compiler level switchwork (Borland Delphi 7.1) & also hand-level ones done in Assembly language!

    (Plus, using sorted lists & a better algorithm for sorting than QuickSort variants on small/medium/large lists (small = insertion sort variation, medium & large = quick sort variation (I changed it on the fly depending on the size of the data being sent into the lists I used (dynamically resizing types of course))).

    I also chose Borland Delphi because it was shown, as far back as 1997 in fact & in a COMPETING TRADE JOURNAL in computing called "Visual Basic Programmer's Journal" Sept./Oct. 1997 issue entitled "INSIDE THE VB5 COMPILER" where Delphi SOUNDLY "KNOCKED THE CHOCOLATE" out of BOTH MSVC++ &/or VB5 by DOUBLE (or, better) in BOTH Math & String processing related tasks work...

    That all "said & aside", well...?

    I only got SO MUCH out of using programmatic optimizations by hand, a 1/2 hour decrease in work time, going from 4.5 hr. runtimes down to 4 hrs. time for all tasks A-D above completing.

    I further used x86 Assembly code, inlined via the asm directive mostly here, or using shifts vs. multiplies etc. et al & more, such as FOR loops vs. WHILE loops etc. too & better algorithms after profiling showed me where I was "slowing up" the most, via hi-res multimedia timers timing all my procedures)

    I also used lastly used compiler switches work (& also, removing ones I did not need for safety once the code proved safe & accurate enough, via using Try-Catch-Except/Finally errtrap methods in Delphi, doing my OWN exception handling & err trapping vs. the built-in "structured exception handlers" in the compiler itself only))...

    Sure, 1/2 hr. less of 4.5 hours, down to 4 hours only, is a decent increase... but? Compared to what you get from BETTER HARDWARE??

    It pales by comparison!

    In fact, I noted this very example in another thread here today ->

    Forrester Says Tech Downturn Is "Unofficially Over":

    http://news.slashdot.org/comments.pl?sid=1508482&cid=30776266 [slashdot.org]

    Where others there ar

  • Re:rule of the code (Score:1, Interesting)

    by Anonymous Coward on Friday January 15, 2010 @05:54AM (#30777016)

    Thanks for rolling out the received wisdom. Like most received wisdom, it is of course wrong, but applies in enough cases and for enough people to be useful.

    What you must bear is mind though is that something like Amdahl's law applies to hotspot optimization. Even if you make the hotspot take zero time, the speed of your code is still limited by the performance of the non-hotspots. This leads to a phenomenon I name "uniformly slow code".

    If your mission is actually to make fast code you need to start from scratch, and create a perfect algorithm, given the hardware constraints. But this is too hard, and too much work in most cases to be worthwhile, hence the hotspot advice.

    As for "your time is better off writing correct code" ... well, if you can't write correct code, go become an actor or something. This is a pre-requisite, not a target.

    You have a lot of confidence in the ability of compilers to work magic. The average compiler is actually pretty bad at all this stuff, and you can forget it on any slightly novel architecture (e.g. PPC vs x86). Knowing how to do it yourself is still useful - or essential, depending on what you're trying to do. Every compiler I've used recently has bugs in it regarding floating-point re-ordering anyway (and I mean *every* compiler - gcc 4 is even worse than MSVC at this) so don't lean on this too heavily. Remember what you said about CORRECT code? It's a lot easier to tweak a flag than it is to robustly test that the algorithm is still working properly.

  • by Mr Z (6791) on Friday January 15, 2010 @10:09AM (#30779052) Homepage Journal

    That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.

    One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.

    On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper [utah.edu] (slides [utah.edu]) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.

God doesn't play dice. -- Albert Einstein

Working...