Cliff Click's Crash Course In Modern Hardware 249
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
Fast forward... (Score:5, Informative)
I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
Re:Fast forward... (Score:5, Funny)
The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
That's just the branch predictor pre-loading the cache for each possible conditional result.
Re: (Score:2)
Re: (Score:2)
Re:Fast forward... (Score:5, Informative)
A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv [thruhere.net]) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.
Re:Fast forward... (Score:4, Informative)
some poor guy's machine through a dynamic DNS service!
Some poor guy? It's on an Amazon EC2 server!
Re:Fast forward... (Score:5, Informative)
You've done it! Interested slashdotters can download the video file at this link:
http://67.202.36.223/presentations/09-sep-JVMperformance.flv [67.202.36.223].
Good detective work, partner!
Re:Fast forward... (Score:4, Informative)
And change the the number at the end to change slides
PS: I am no good in javascript, but the above, in FF 3.5/Linux, just displayed a page saying "true"
Re:Fast forward... (Score:4, Informative)
And please stop hitting the server liked in GP's post. The poor server hardly sustains 30 KB/s.
Re: (Score:2)
You have to admit it's pretty nice to have the presentation slides automatically display and advance below the video as you watch..
Premature optimization is evil... and stupid (Score:2, Insightful)
That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.
Re:Premature optimization is evil... and stupid (Score:5, Funny)
Re: (Score:2)
If (and only if!)
Compiler Error: Numerous Syntax Errors.
Line 1, 4; Object Expected
Line 1, 15; '(' Expected
Line 1, 16; Condition Expected
Line 1, 17; 'Then' Expected
Re: (Score:2)
That was Lisp. You should parse it as If(only && !if).
Re:Premature optimization is evil... and stupid (Score:4, Interesting)
Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.
Re:Premature optimization is evil... and stupid (Score:4, Informative)
Using shift to multiply is often a great idea on most CPUs.
Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.
Didn't know that? yeah... it seems like only assembly language programs know this.
Re:Premature optimization is evil... and stupid (Score:5, Informative)
Those with a barrel shifter.
Then someone needs to beat the GCC developers with a cluestick. ...
$ cat test.c
int main(int argc, char **argv) {
return 4*(unsigned int)argc;
}
$ gcc -march=core2 test.c -o test
$ objdump -d test
00000000004004ec <main>:
4004ec: 55 push %rbp
4004ed: 48 89 e5 mov %rsp,%rbp
4004f0: 89 7d fc mov %edi,-0x4(%rbp)
4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
4004f7: 8b 45 fc mov -0x4(%rbp),%eax
4004fa: c1 e0 02 shl $0x2,%eax
4004fd: c9 leaveq
4004fe: c3 retq
4004ff: 90 nop
I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.
Re: (Score:3, Insightful)
It depends on where they spend their hardware, and what you're multiplying by. You can make a multiplier faster than shifting, it just requires a lot of hardware to do so. If you're multiplying by a constant power of 2, shifting will always be as fast or faster. If you're multiplying by a non power of 2 constant, shifting and adding may be faster, and probably is if there's fairly few 1s in the binary representation. But if they have a good multiplier then mult may be faster than shift/add for a random
Re: (Score:2)
I was talking of multiplying by a power of two constant, of course. You're quite correct in saying that shift+add combinations may or may not be faster than multiplying by more complex constants, depending on the particular implementation. Usually, two shifts and one add is a fairly safe bet for simpler CPUs, but it can actually slow things down on modern superscalar CPUs where it creates undesirable dependencies in the pipeline.
Re: (Score:3, Informative)
Re: (Score:2)
Not taking the piss... i have no idea - i just noticed you didn't use any architecture specific flags so its no doubt defaulted to dumb but compatible code?
Re: (Score:2)
Re:Premature optimization is evil... and stupid (Score:5, Informative)
This is one of the reasons that GCC sucks compared to ICC and VC++.
Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.
If this was the entire story, then they would be equal. Buts its not the entire story.
The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.
"But Rockoon.. they are the same latency anyways, right?"
Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.
In some code, it wont make any discernible difference, but in other code it will make a big difference.
As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.
There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.
One is a straight multiplication (MASM syntax, CDECL):
main:
mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
imul eax, edx, 4
ret
The other is leveraging the LEA instruction (MASM syntax, CDECL):
main:
mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
lea eax, [eax * 4]
ret
The alternative LEA version on some processors (P4..), in isolation, is slower
GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.
It's just outdated knowledge (Score:3, Informative)
People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.
Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register
Re: (Score:2)
I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0. Because they want existing code to run as fast as possible, and in x86 compatibility-is-kin
Re:It's just outdated knowledge (Score:4, Informative)
You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.
Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.
Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
ADD AL, AH
MOV CL, 0
JC somewhere
should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
XOR CL, 0
ADD AL, AH
JC somewhere
executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.
Re: (Score:3, Informative)
Actually the reason us old fogies normally used XOR A, A rather than LD A, 0 wasn't because it was faster but rather because it was smaller - 1 byte rather than two bytes (instruction + immediate operand). On the old memory constrained 8-bitters, these assembly "tricks" were all about saving a byte here, another byte there...
Re:It's just outdated knowledge (Score:4, Interesting)
The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.
Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).
Re: (Score:2)
so there's no reason to explicitly use shift in code (unless you're doing bit manipulation
Well, right. The general advice is to always write what you actually want the compiler to do and not how to do it, unless you have specific proof that the compiler's not optimizing it well.
In C/C++ shift is not the same as multiply/divide (Score:2, Interesting)
In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:
printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
printf(" 3 << 1
Re: (Score:2)
The opposite problem also exists though- by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite. Saying optimize at the end is just as stupid and just as costly. Learning when to care about what level is part of the art of programming. (Although on your specific examples I'll agree with you- especially since I would expect anything but a really old compiler to do mult->shift conversions for you, so you may as well use the more
Re: (Score:2)
Saying optimize at the end is just as stupid and just as costly.
There is an enormous difference between optimization and choosing appropriate algorithms. If you write a program well, it's almost always easy to optimize it later. If you write it poorly, it'll almost always be impossible to optimize at any point of its development. For example, I'd rather sort a big array with an unoptimized (but correct) quicksort than with an extremely clever (but insane) bogosort.
Re: (Score:3, Interesting)
"Not thinking about performance" is different from writing in high level first.
Get the algorithm right first, THEN optimise hot spots.
Starting out with ASM makes it a lot more time consuming/difficult to get many different algorithms written, debugged and tested. The time you spend doing that is time better spent testing/developing a better algorithm. Only once you get the
Re: (Score:3, Informative)
Re: (Score:2, Interesting)
I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.
However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data st
Re: (Score:2)
Interesting anecdote that has nothing to do with optimization and everything to do with bad design. Optimization is great for making your program run n% faster. Design is great for making your program run in O(log n) time instead of O(n^2) time. The important part is to come up with a good design, implement it, and address the specific problem areas. I can't think of a single justification for doing it any other way.
Re: (Score:2)
Re:Premature optimization is evil... and stupid (Score:5, Insightful)
Having spent 4 years being one of the primary developers of Apple's main performance analysis tools (CHUD, not Instruments) and having helped developers from nearly every field imaginable tune their applications for performance, I can honestly say that regardless of your performance criteria, you shouldn't be doing anything special for optimization when you first write a program. Some thought should be given to the architecture and overall data flow of the program and how that design might have some high-level performance limits, but certainly no code should be written using explicit vector operations and all loops should be written for clarity. Scalability by partitioning the work is one of those items that can generally be incorporated into the program's architecture if the program lends itself to it, but most other performance-related changes depend on specific usage cases. Trying to guess those while writing the application logic relies solely on intuition which is usually wrong.
After you've written and debugged the application, profiling and tracing is the prime way for finding _where_ to do optimization. Your experiences have been tainted by the poor quality of tools known by the larger OSS community, but many good tools are free (as in beer) for many OSes (Shark for OS X as an example) while others cost a bit (VTune for Linux or Windows). Even large, complex multi-threaded programs can be profiled and tuned with decent profilers. I know for a fact that Shark is used to tune large applications such as Photoshop, Final Cut Pro, Mathematica, and basically every application, daemon, and framework included in OS X.
What do you do if there really isn't much of a hotspot? Quake 3 was an example where the time was spread out over many C++ methods so no one hotspot really showed up. Using features available in the better profiling tools, the collected samples could be attributed up the stack to the actual algorithms instead of things like simple accessors. Once you do that, the problems become much more obvious.
What do you do after the application has been written and a major performance problem is found that would require an architectural change? Well, you change the architecture. The reason for not doing it during the initial design is that predicting performance issues is near impossible even for those of us who have spent years doing it as a full time job. Sure, you have to throw away some code or revisit the design to fix the performance issues, but that's a normal part of software design. You try an approach, find out why it won't work, and use that knowledge to come up with a new approach.
That largest failing I see from my experiences have been the lack of understanding by management and engineers that performance is a very iterative part of software design and that it happens late in the game. Frequently, schedules get set without consideration for the amount of time required to do performance analysis, let alone optimization. Then you have all the engineers who either try to optimize everything they encounter and end up wasting lots of time, or they do the initial implementation and never do any profiling.
Ultimately, if you try to build performance into a design very early, you end up with a big, messy, unmaintainable code base that isn't actually all that fast. If you build the design cleanly and then optimize the sections that actually need it, you have a most maintainable code base that meets the requirements. Be the latter.
Re:Premature optimization is evil... and stupid (Score:4, Insightful)
Over the years, I've grown to hate this meme. Not because it isn't right, but because it stops ten floors below the penthouse of human potential.
First of all, it's an incredible instance of cultural drift. In the mid 1980s, when this meme was halfway current, I worked on adding support for Asian characters to an Asian-made PC. On the "make it right" pass it took 15s to update the screen after pressing the page down key, and this from assembly language. Slower than YouTube over 300 baud. It was doing a lot of pixel swizzling it shouldn't have been, because the fonts were supplied in a format better suited to printing. This was an order of magnitude below an invitation to whiffle-ball training camp. This was Lance Armstrong during his chemotherapy years towing a baby trailer. Today you get 60fps with a 100 thousand or a 100 million polygons, I've sort of lost track.
Let's not shunt performance onto the side track of irrelevancy. While there's no good excuse, ever, for writing faulty code, an enlightened balance between starting out with an approach you can live with, and exploiting necessary cleverness *within your ability* goes a long way.
How about we update Knuth's arthritic maxim? Don't tweak what you don't grok. If you grok, use your judgement. Exploit your human potential. Live a little.
The books I've been reading lately about the evolution of skills in the work place suggest that painstaking reductive work processes are on their way to India. Job security in home world is greatly enhanced if you can navigate multiple agendas in tandem, exploiting more of that judgement thing.
One of the reasons Carmack became so successful is that he didn't waste his effort looking for excuses to deprive his co-workers of their oxygen bits. Instead he conducted shrewd excursions along the edge of the envelope in pursuit of the sweet spot between cleverness too oppressive to live with, and no performance at all.
In my day of deprecating my elders, I always knew where the pea was hidden under the mattress. These days, there are so many squishy mattresses stacked one upon the other, I have to plan my work day with a step ladder. Which I think is what this unwatchable cult-encoded video is on about: the ankle level view most of us never see any more.
Here's another thing. I've you're going to be clever about how you code something, also be clever about how you do it. In other words, be equally clever all levels of the solution process simultaneously: algorithm selection, implementation, commenting, software engineering, documentation, and unit test. Knuth got away with TeX, barely, for precisely this reason. Because of his cleverness, the extension to handle Asian languages was far from elegant. Because of his cleverness (in making everything else run extremely well), people actually wanted to extend TeX to handle Asian languages. So who's to say he was wrong? Despite his cleverness, he managed to keep his booboo score in single or low double digits. His bug tracking database fit nicely on an index card.
In the modern era, people quote the old "make it right before you make it faster" as the cure for the halitosis of ineptitude: you're feeble and irritating, so practice your social graces. Don't make me come over there and choke off your oxygen bit. It's a long ways from saying "you have a lot of human potential, and not much experience, so let me help you confront the challenges in a meaningful way". These sayings leak a lot of sentiment about social engagement.
Every so often I have to pull up a chair beside a junior resource and go "Dude, you're jousting at windmills here, let's roll that change back and try again. I know you can do better." Five minutes of war stories about how to shoot yourself in the foot six ways from Sunday is usually enough to rebalance the flywheel of self preservation.
It's not just x86 (Score:4, Informative)
Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.
The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.
I hate flash video (Score:2)
I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.
Kung-Fu and Ninjitsu...They're not dead! (Score:2)
This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...
rule of the code (Score:3, Informative)
Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.
The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
Re: (Score:3, Informative)
Re: (Score:3, Informative)
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.
...except for the uControllers I use. (Score:3, Interesting)
I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).
I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.
But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.
What about prefetching? (Score:3, Interesting)
That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.
One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.
On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper [utah.edu] (slides [utah.edu]) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.
Re:Could someone give me a crash course (Score:5, Funny)
Probably due to your x86 processor doing all sorts of monkeying with the code.
Re: (Score:3, Funny)
Re:Could someone give me a crash course (Score:5, Funny)
Spaghetti code can be hard to digest.
Sounds to me like someone is using stale Copypasta.
Re: (Score:2)
They made the meatballs out of DEADBEEF.
Re: (Score:2)
gotten at the 0xCAFE 0F DEAD BEEF
Re: (Score:3, Informative)
I hear they have nice 0xC0FFEE
Re: (Score:2)
Re:Code in high-level (Score:5, Insightful)
That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
Re: (Score:2, Interesting)
Yah sorry about that. :)
Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since th
Re: (Score:3, Informative)
intel compilers have options to optimize to more than one target, and its runtime engine uses code that was made for X cpu. Sure your binary is larger, but everyone is happy.
Re:Code in high-level (Score:4, Interesting)
That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.
However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.
Re:Code in high-level (Score:5, Informative)
Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.
Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.
Re: (Score:3, Insightful)
Re: (Score:2)
Also, some of
Re:Code in high-level (Score:4, Interesting)
Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).
Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.
Re:Code in high-level (Score:5, Informative)
One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory
I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.
Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.
If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.
Re: (Score:2)
For example in geophysics there is a process of arranging data called "Pre Stack Time Migration" which can keep a small cluster busy for a week with relatively small datasets. In cases like that tiny improvements save hours. Only one percent of improvement saves more than an hour in a week.
Re: (Score:2)
I heard a rumor that there's some fundamental geophysical program that's been around for decades. It doesn't accumulate the results in an array, because memory was too expensive when fortran 66 was the hot new thing.
It has a write-to-disk instruction in an inner loop. But it works, and nobody wants to touch it.
A little micro-optimization there would grant a 1000x speedup.
Re: (Score:2)
Takes about 30 seconds to set up in most any modern OS, all but the cheapest and nastiest contemporary systems have enough RAM that you can safely carve out something larger than any HDD of the fortran66 era, and(while not as fast as using RAM properly) should run like a bat out of hell compared to any actual disk....
Re: (Score:3, Informative)
It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.
In contrast, PGI or ICC is much harder to beat using assembly.
Re:Code in high-level (Score:4, Interesting)
Re:Code in high-level (Score:5, Informative)
This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:
In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.
You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.
Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.
I wrote a longer explanation a while ago [informit.com].
Re: (Score:2)
It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.
In contrast, PGI or ICC is much harder to beat using assembly.
ICC does a great work with auto-vectorization.
Yet (perhaps it's no longer true), ~3 years ago I had problems with ICC generating wrong code in certain situations. I went back to GCC.
Re: (Score:2)
In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations.
And that will work until the next rev of the board's chip, which your hardware vendor will change when he wants to and not notify you about. You'll know about it when the customer complaints roll in about poor performance or during your next rev of the firmware when your performance stats go to hell. And, if you're trying to do this for COTS hardware, forget it
Re: (Score:3, Insightful)
Re: (Score:3, Insightful)
That's *generally* true. It's not *always* true.
There are a lot of purely compute-bound applications (think simulations of various sorts, etc) for which the algorithmic optimizations have already been done- but it's still worth going for the last few percent of performance from "instruction fiddling". As another poster said: if your app runs for weeks at a time, 1% improvement becomes significant in terms of time saved- and throwing more hardware at the problem isn't always feasible.
Re: (Score:3, Insightful)
Sometimes it's just plain FUN FUN FUN to code in asm. You're right that most programmers will never have a need for it at all (with some exceptions, such as those messing with operating systems or embedded systems), although knowing some ASM can help a lot with debugging. I suppose one could (read: should) learn a little ASM to have a better idea of what the hardware is doing, this will allow you to optimize your code a little, or (more importantly) write it in such a way that makes it easier for the compil
Re:Code in high-level (Score:4, Informative)
Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".
On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.
Re:Code in high-level (Score:4, Interesting)
Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.
Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.
In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.
Being able to use %r2 for the same stuff you use %r1 for is just boring.
Re: (Score:2)
Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".
No need to go that far.
68k ASM is pure heaven compared to x86.
Re:Code in high-level (Score:4, Informative)
Re:Code in high-level (Score:4, Interesting)
Re: (Score:3, Interesting)
That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!
Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly program
Re: (Score:2)
Re: (Score:2)
I think you can legally get MASM (Microsoft Macro Assembler) somewhere on the internet for free. A good place to start would be Microsoft. Then you can do what real coders do, and teach yourself!
And to think I paid several hundred dollars for that, back in the day.
Re: (Score:2, Informative)
Or you could get NASM, which is open source :)
Re: (Score:2)
Re: (Score:2)
Or you could get WASM (part of the Open Watcom package at www.openwatcom.org) which is open-source AND uses something approaching standard syntax.
NASM unfortunately falls into the common trap of figuring that, since MASM-style syntax has a lot wrong with it, the syntax should be changed. But as with all such projects, the syntax is changed to fit someone's particular taste, and now you'll write source code which isn't compatible with anything. And IMHO NASM's syntax is no improvement over MASM anyway. AL
Re: (Score:2)
Will NASM let you write structured assembly, like MASM?
I picked up a used copy of Inner Loops [amazon.com] by Rick Booth, and it intrigued me enough to consider tracking down an old version of MASM.
Re: (Score:2)
I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.
I'm probably going to need an asbestos suit for this post, but to be honest I don't think assembler is a good programming language for humans. My impression is that they absolutely don't want to pollute the instruction set with instructions unless there's a performance benefit to doing so. But what it means in practice is that anyone I've seen writing advanced assembly relies on lots and lots of macros to do essential things, because the combination of instructions is useful but there's no language construc
Re: (Score:2)
It's not a great language (family) for general use, but it is a good way to learn something about how CPUs work, what a function call actually is, etc.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Interesting)
I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into w
Re: (Score:3, Insightful)
But, its certainly better to code in a high level language first, test, tweak the algorithm as much as you can, PROFILE and THEN start breaking out your assembler. No point optimising 99% of your code in super fast asm if it only spends 1% of the cpu time in it. Even if you make all that code 10x as fast, you've only saved 0.9% cpu time. :)
Re: (Score:2)
In non-trivial single threaded application code on a modern processor, the CPU core is spending about 95% of its time waiting on memory transfers. To fix that problem, it can make sense to prefetch and reorder memory accesses. Chances are you know better than your compiler how to do that. It also makes sense to start more threads on a processor with multiple hardware threads so you can do things while waiting for memory.
Most programmers won't even bother to do that, because the processor is fast enou
Re: (Score:3, Insightful)
Someone has to write those tools.
Yeah, but they can be written in a HLL, too. You don't have to write a program in highly-tuned assembler to make it emit highly-tuned assembler.
Re: (Score:2)
You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes. To really over simplify: Compliers are pretty much syntax checkers and search tree engines. They take your code and replace it with a matching assembly listing or set of listings substituting which ever registers happen to be free etc etc.
Re: (Score:2)
You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes.
That's news to GCC:
Translation: In GCC 4.2.1 as shipped with FreeBSD 8-STABLE, there are 869 .c and .h files with a total of 900KLOC, and 34 .asm files with 6KLOC. It seems that GCC itself isn't written with very much assembler.
Re: (Score:2)
My first "real" programming was using a machine language monitor on a C64, so I feel your pain.
Re: (Score:2)
I've heard that the first C compiler was written in C.
Re: (Score:2)
Re: (Score:2)
Try waiting for it to full buffer?