Cliff Click's Crash Course In Modern Hardware 249

Posted by timothy on Thursday January 14, 2010 @07:19PM from the first-there-were-the-dinosaurs dept.

Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"

This discussion has been archived. No new comments can be posted.

Cliff Click's Crash Course In Modern Hardware

Load All Comments

Search 249 Comments Log In/Create an Account

Comments Filter:

Fast forward... (Score:5, Informative)

by LostCluster ( 625375 ) * writes: on Thursday January 14, 2010 @07:20PM (#30772618)

I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

Share
twitter facebook
- Re:Fast forward... (Score:5, Funny)
  
  by Jah-Wren Ryel ( 80510 ) writes: on Thursday January 14, 2010 @07:30PM (#30772738)
  
  The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
  That's just the branch predictor pre-loading the cache for each possible conditional result.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Ginger Unicorn ( 952287 ) writes:
    
    That is proper hard-fucking-core geek wit. Bravo.
- - Re: (Score:2)
    
    by Gazzonyx ( 982402 ) writes:
    
    You're lucky that you didn't get it to play; mine played to six minutes and then just stopped and won't play or let me skip past that.
  - - Re:Fast forward... (Score:5, Informative)
      
      by Brian Gordon ( 987471 ) writes: on Thursday January 14, 2010 @09:06PM (#30773782)
      
      A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv [thruhere.net]) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.
      
      Parent Share
      twitter facebook
      - Re:Fast forward... (Score:4, Informative)
        
        by pyrrhonist ( 701154 ) writes: on Thursday January 14, 2010 @11:42PM (#30775012)
        
        some poor guy's machine through a dynamic DNS service!
        Some poor guy? It's on an Amazon EC2 server!
        $ host flv.thruhere.net flv.thruhere.net has address 67.202.36.223 $ host 67.202.36.223 223.36.202.67.in-addr.arpa domain name pointer ec2-67-202-36-223.compute-1.amazonaws.com.
        
        Parent Share
        twitter facebook
        
        Re:Fast forward... (Score:5, Informative)
        
        by Brian Gordon ( 987471 ) writes: on Thursday January 14, 2010 @11:47PM (#30775044)
        
        You've done it! Interested slashdotters can download the video file at this link:
        http://67.202.36.223/presentations/09-sep-JVMperformance.flv [67.202.36.223].
        Good detective work, partner!
        
        Parent Share
        twitter facebook
        
        Re:Fast forward... (Score:4, Informative)
        
        by iammani ( 1392285 ) writes: on Friday January 15, 2010 @12:39AM (#30775328)
        
        Or type http://www.infoq.com/resource/presentations/click-crash-course-modern-hardware/en/slides/1.swf [infoq.com]
        
        And change the the number at the end to change slides
        
        PS: I am no good in javascript, but the above, in FF 3.5/Linux, just displayed a page saying "true"
        
        Parent Share
        twitter facebook
      - Re:Fast forward... (Score:4, Informative)
        
        by iammani ( 1392285 ) writes: on Friday January 15, 2010 @12:30AM (#30775284)
        
        Mirror available at http://www.mediafire.com/?j21t2ynnnzn [mediafire.com]
        
        And please stop hitting the server liked in GP's post. The poor server hardly sustains 30 KB/s.
        
        Parent Share
        twitter facebook
- - Re: (Score:2)
    
    by Brian Gordon ( 987471 ) writes:
    
    You have to admit it's pretty nice to have the presentation slides automatically display and advance below the video as you watch..
Premature optimization is evil... and stupid (Score:2, Insightful)

by Just Some Guy ( 3352 ) writes:

That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.
- Re:Premature optimization is evil... and stupid (Score:5, Funny)
  
  by RightSaidFred99 ( 874576 ) writes: on Thursday January 14, 2010 @07:44PM (#30772906)
  
  And messy and embarrassing. Oh, wait...
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Monkeedude1212 ( 1560403 ) writes:
  
  If (and only if!)
  
  Compiler Error: Numerous Syntax Errors.
  Line 1, 4; Object Expected
  Line 1, 15; '(' Expected
  Line 1, 16; Condition Expected
  Line 1, 17; 'Then' Expected
  - Re: (Score:2)
    
    by Just Some Guy ( 3352 ) writes:
    
    That was Lisp. You should parse it as If(only && !if).
- Re:Premature optimization is evil... and stupid (Score:4, Interesting)
  
  by marcansoft ( 727665 ) writes: <hector@nOsPaM.marcansoft.com> on Thursday January 14, 2010 @08:00PM (#30773114) Homepage
  
  Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.
  
  Parent Share
  twitter facebook
  - Re:Premature optimization is evil... and stupid (Score:4, Informative)
    
    by Rockoon ( 1252108 ) writes: on Thursday January 14, 2010 @08:11PM (#30773220)
    
    Using shift to multiply is often a great idea on most CPUs.
    Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.
    
    Didn't know that? yeah... it seems like only assembly language programs know this.
    
    Parent Share
    twitter facebook
    - Re:Premature optimization is evil... and stupid (Score:5, Informative)
      
      by marcansoft ( 727665 ) writes: <hector@nOsPaM.marcansoft.com> on Thursday January 14, 2010 @09:00PM (#30773736) Homepage
      
      Which CPU's are those?
      Those with a barrel shifter.
      The fastest way to multiply today on AMD/Intel is to use the multiply instructions.
      Then someone needs to beat the GCC developers with a cluestick.
      $ cat test.c int main(int argc, char **argv) { return 4*(unsigned int)argc; } $ gcc -march=core2 test.c -o test $ objdump -d test ... 00000000004004ec <main>: 4004ec: 55 push %rbp 4004ed: 48 89 e5 mov %rsp,%rbp 4004f0: 89 7d fc mov %edi,-0x4(%rbp) 4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp) 4004f7: 8b 45 fc mov -0x4(%rbp),%eax 4004fa: c1 e0 02 shl $0x2,%eax 4004fd: c9 leaveq 4004fe: c3 retq 4004ff: 90 nop
      yeah... it seems like only assembly language programs know this.
      I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Insightful)
        
        by AuMatar ( 183847 ) writes:
        
        It depends on where they spend their hardware, and what you're multiplying by. You can make a multiplier faster than shifting, it just requires a lot of hardware to do so. If you're multiplying by a constant power of 2, shifting will always be as fast or faster. If you're multiplying by a non power of 2 constant, shifting and adding may be faster, and probably is if there's fairly few 1s in the binary representation. But if they have a good multiplier then mult may be faster than shift/add for a random
        
        Re: (Score:2)
        
        by marcansoft ( 727665 ) writes:
        
        I was talking of multiplying by a power of two constant, of course. You're quite correct in saying that shift+add combinations may or may not be faster than multiplying by more complex constants, depending on the particular implementation. Usually, two shifts and one add is a fairly safe bet for simpler CPUs, but it can actually slow things down on modern superscalar CPUs where it creates undesirable dependencies in the pipeline.
        
        Re: (Score:3, Informative)
        
        by TheRaven64 ( 641858 ) writes:
        
        I actually did a benchmark of this a few months ago. For a single shift, there wasn't much in it (on a Core 2); both were decoded into the same micro-ops. For more than one shift and add, the multiply was faster because the micro-op fusion engine wasn't clever enough to reassemble the multiply (and even if it were, you're still burning i-cache for no reason). GCC used to emit shift-and-add sequences for all constant multiplies until someone benchmarked it on an Athlon (which had two multiply units and on
      - Re: (Score:2)
        
        by smash ( 1351 ) writes:
        
        Does that code change if you use the arch flags for GCC to generate AMD64 or at least i686 code?
        Not taking the piss... i have no idea - i just noticed you didn't use any architecture specific flags so its no doubt defaulted to dumb but compatible code?
        
        Re: (Score:2)
        
        by smash ( 1351 ) writes:
        
        UH... delete that comment, i didn't see -march=core2. Sorry....
      - Re:Premature optimization is evil... and stupid (Score:5, Informative)
        
        by Rockoon ( 1252108 ) writes: on Friday January 15, 2010 @09:02AM (#30777752)
        
        GCC is a big offender, thats true.
        
        This is one of the reasons that GCC sucks compared to ICC and VC++.
        
        Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.
        
        If this was the entire story, then they would be equal. Buts its not the entire story.
        
        The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.
        
        "But Rockoon.. they are the same latency anyways, right?" .. yes, in isolation. But that read/modify/write cycle on the flags register prevents a hell of a lot of out-of-order execution.
        
        Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.
        
        In some code, it wont make any discernible difference, but in other code it will make a big difference.
        
        As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.
        
        There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.
        
        One is a straight multiplication (MASM syntax, CDECL):
        
        main:
        mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
        imul eax, edx, 4
        ret
        
        The other is leveraging the LEA instruction (MASM syntax, CDECL):
        
        main:
        mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
        lea eax, [eax * 4]
        ret
        
        The alternative LEA version on some processors (P4..), in isolation, is slower .. but it has the advantage that it uses different execution units on those very same processors, so might pair better with other stuff in the pipeline, and it doesnt touch the flags register at all.
        
        GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.
        
        Parent Share
        twitter facebook
    - It's just outdated knowledge (Score:3, Informative)
      
      by Sycraft-fu ( 314770 ) writes:
      
      People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.
      Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register
      - Re: (Score:2)
        
        by marcansoft ( 727665 ) writes:
        
        Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.
        I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0. Because they want existing code to run as fast as possible, and in x86 compatibility-is-kin
        
        Re:It's just outdated knowledge (Score:4, Informative)
        
        by Cassini2 ( 956052 ) writes: on Thursday January 14, 2010 @09:37PM (#30774052)
        
        I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0.
        You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.
        Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.
        
        Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.
        Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
        ADD AL, AH MOV CL, 0 JC somewhere
        should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
        XOR CL, 0 ADD AL, AH JC somewhere
        executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.
        
        Parent Share
        twitter facebook
      - Re: (Score:3, Informative)
        
        by SpinyNorman ( 33776 ) writes:
        
        Actually the reason us old fogies normally used XOR A, A rather than LD A, 0 wasn't because it was faster but rather because it was smaller - 1 byte rather than two bytes (instruction + immediate operand). On the old memory constrained 8-bitters, these assembly "tricks" were all about saving a byte here, another byte there...
        
        Re:It's just outdated knowledge (Score:4, Interesting)
        
        by BZ ( 40346 ) writes: on Friday January 15, 2010 @12:06AM (#30775154)
        
        The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.
        Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).
        
        Parent Share
        twitter facebook
  - Re: (Score:2)
    
    by Just Some Guy ( 3352 ) writes:
    
    so there's no reason to explicitly use shift in code (unless you're doing bit manipulation
    Well, right. The general advice is to always write what you actually want the compiler to do and not how to do it, unless you have specific proof that the compiler's not optimizing it well.
  - In C/C++ shift is not the same as multiply/divide (Score:2, Interesting)
    
    by perpenso ( 1613749 ) writes:
    
    Using shift to multiply is often a great idea on most CPUs.
    In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:
    
    printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1); printf(" 3 << 1
- Re: (Score:2)
  
  by AuMatar ( 183847 ) writes:
  
  The opposite problem also exists though- by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite. Saying optimize at the end is just as stupid and just as costly. Learning when to care about what level is part of the art of programming. (Although on your specific examples I'll agree with you- especially since I would expect anything but a really old compiler to do mult->shift conversions for you, so you may as well use the more
  - Re: (Score:2)
    
    by Just Some Guy ( 3352 ) writes:
    
    Saying optimize at the end is just as stupid and just as costly.
    There is an enormous difference between optimization and choosing appropriate algorithms. If you write a program well, it's almost always easy to optimize it later. If you write it poorly, it'll almost always be impossible to optimize at any point of its development. For example, I'd rather sort a big array with an unoptimized (but correct) quicksort than with an extremely clever (but insane) bogosort.
  - Re: (Score:3, Interesting)
    
    by smash ( 1351 ) writes:
    
    by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite.
    
    "Not thinking about performance" is different from writing in high level first.
    Get the algorithm right first, THEN optimise hot spots.
    Starting out with ASM makes it a lot more time consuming/difficult to get many different algorithms written, debugged and tested. The time you spend doing that is time better spent testing/developing a better algorithm. Only once you get the
- Re: (Score:3, Informative)
  
  by tomtefar ( 935007 ) writes:
  
  I have the following sticker on top of my display: "Make it work before you make it fast!" Saved me many hours of work.
- Re: (Score:2, Interesting)
  
  by Anonymous Coward writes:
  
  I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.
  However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data st
  - Re: (Score:2)
    
    by Just Some Guy ( 3352 ) writes:
    
    Interesting anecdote that has nothing to do with optimization and everything to do with bad design. Optimization is great for making your program run n% faster. Design is great for making your program run in O(log n) time instead of O(n^2) time. The important part is to come up with a good design, implement it, and address the specific problem areas. I can't think of a single justification for doing it any other way.
  - Re: (Score:2)
    
    by smash ( 1351 ) writes:
    
    On the contrary, i'd be more concerned that the medical software is CORRECT. You can throw more hardware at the problem to make it faster. You can't throw more hardware at the problem to correct bugs.
  - Re:Premature optimization is evil... and stupid (Score:5, Insightful)
    
    by kc8apf ( 89233 ) writes: <kc8apf@kc[ ]f.net ['8ap' in gap]> on Friday January 15, 2010 @03:12AM (#30776028) Homepage
    
    Having spent 4 years being one of the primary developers of Apple's main performance analysis tools (CHUD, not Instruments) and having helped developers from nearly every field imaginable tune their applications for performance, I can honestly say that regardless of your performance criteria, you shouldn't be doing anything special for optimization when you first write a program. Some thought should be given to the architecture and overall data flow of the program and how that design might have some high-level performance limits, but certainly no code should be written using explicit vector operations and all loops should be written for clarity. Scalability by partitioning the work is one of those items that can generally be incorporated into the program's architecture if the program lends itself to it, but most other performance-related changes depend on specific usage cases. Trying to guess those while writing the application logic relies solely on intuition which is usually wrong.
    After you've written and debugged the application, profiling and tracing is the prime way for finding _where_ to do optimization. Your experiences have been tainted by the poor quality of tools known by the larger OSS community, but many good tools are free (as in beer) for many OSes (Shark for OS X as an example) while others cost a bit (VTune for Linux or Windows). Even large, complex multi-threaded programs can be profiled and tuned with decent profilers. I know for a fact that Shark is used to tune large applications such as Photoshop, Final Cut Pro, Mathematica, and basically every application, daemon, and framework included in OS X.
    What do you do if there really isn't much of a hotspot? Quake 3 was an example where the time was spread out over many C++ methods so no one hotspot really showed up. Using features available in the better profiling tools, the collected samples could be attributed up the stack to the actual algorithms instead of things like simple accessors. Once you do that, the problems become much more obvious.
    What do you do after the application has been written and a major performance problem is found that would require an architectural change? Well, you change the architecture. The reason for not doing it during the initial design is that predicting performance issues is near impossible even for those of us who have spent years doing it as a full time job. Sure, you have to throw away some code or revisit the design to fix the performance issues, but that's a normal part of software design. You try an approach, find out why it won't work, and use that knowledge to come up with a new approach.
    That largest failing I see from my experiences have been the lack of understanding by management and engineers that performance is a very iterative part of software design and that it happens late in the game. Frequently, schedules get set without consideration for the amount of time required to do performance analysis, let alone optimization. Then you have all the engineers who either try to optimize everything they encounter and end up wasting lots of time, or they do the initial implementation and never do any profiling.
    Ultimately, if you try to build performance into a design very early, you end up with a big, messy, unmaintainable code base that isn't actually all that fast. If you build the design cleanly and then optimize the sections that actually need it, you have a most maintainable code base that meets the requirements. Be the latter.
    
    Parent Share
    twitter facebook
- Re:Premature optimization is evil... and stupid (Score:4, Insightful)
  
  by epine ( 68316 ) writes: on Friday January 15, 2010 @08:15AM (#30777476)
  
  That's the main reason why I want to shoot people who write "clever" code on the first pass.
  Over the years, I've grown to hate this meme. Not because it isn't right, but because it stops ten floors below the penthouse of human potential.
  First of all, it's an incredible instance of cultural drift. In the mid 1980s, when this meme was halfway current, I worked on adding support for Asian characters to an Asian-made PC. On the "make it right" pass it took 15s to update the screen after pressing the page down key, and this from assembly language. Slower than YouTube over 300 baud. It was doing a lot of pixel swizzling it shouldn't have been, because the fonts were supplied in a format better suited to printing. This was an order of magnitude below an invitation to whiffle-ball training camp. This was Lance Armstrong during his chemotherapy years towing a baby trailer. Today you get 60fps with a 100 thousand or a 100 million polygons, I've sort of lost track.
  Let's not shunt performance onto the side track of irrelevancy. While there's no good excuse, ever, for writing faulty code, an enlightened balance between starting out with an approach you can live with, and exploiting necessary cleverness *within your ability* goes a long way.
  How about we update Knuth's arthritic maxim? Don't tweak what you don't grok. If you grok, use your judgement. Exploit your human potential. Live a little.
  The books I've been reading lately about the evolution of skills in the work place suggest that painstaking reductive work processes are on their way to India. Job security in home world is greatly enhanced if you can navigate multiple agendas in tandem, exploiting more of that judgement thing.
  One of the reasons Carmack became so successful is that he didn't waste his effort looking for excuses to deprive his co-workers of their oxygen bits. Instead he conducted shrewd excursions along the edge of the envelope in pursuit of the sweet spot between cleverness too oppressive to live with, and no performance at all.
  In my day of deprecating my elders, I always knew where the pea was hidden under the mattress. These days, there are so many squishy mattresses stacked one upon the other, I have to plan my work day with a step ladder. Which I think is what this unwatchable cult-encoded video is on about: the ankle level view most of us never see any more.
  Here's another thing. I've you're going to be clever about how you code something, also be clever about how you do it. In other words, be equally clever all levels of the solution process simultaneously: algorithm selection, implementation, commenting, software engineering, documentation, and unit test. Knuth got away with TeX, barely, for precisely this reason. Because of his cleverness, the extension to handle Asian languages was far from elegant. Because of his cleverness (in making everything else run extremely well), people actually wanted to extend TeX to handle Asian languages. So who's to say he was wrong? Despite his cleverness, he managed to keep his booboo score in single or low double digits. His bug tracking database fit nicely on an index card.
  In the modern era, people quote the old "make it right before you make it faster" as the cure for the halitosis of ineptitude: you're feeble and irritating, so practice your social graces. Don't make me come over there and choke off your oxygen bit. It's a long ways from saying "you have a lot of human potential, and not much experience, so let me help you confront the challenges in a meaningful way". These sayings leak a lot of sentiment about social engagement.
  Every so often I have to pull up a chair beside a junior resource and go "Dude, you're jousting at windmills here, let's roll that change back and try again. I know you can do better." Five minutes of war stories about how to shoot yourself in the foot six ways from Sunday is usually enough to rebalance the flywheel of self preservation.
  Read the rest of this comment...
  
  Parent Share
  twitter facebook
It's not just x86 (Score:4, Informative)

by RzUpAnmsCwrds ( 262647 ) writes: on Thursday January 14, 2010 @08:41PM (#30773542)

Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.
The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

Share
twitter facebook
I hate flash video (Score:2)

by Omnifarious ( 11933 ) * writes:

I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.
Kung-Fu and Ninjitsu...They're not dead! (Score:2)

by geekmux ( 1040042 ) writes:

This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...
rule of the code (Score:3, Informative)

by Bork ( 115412 ) writes: on Thursday January 14, 2010 @09:38PM (#30774060) Homepage

Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.
The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by XMunkki ( 533952 ) writes:
  
  I agree that many low-level programming methods aren't that necessary anyhow, but there is one big point where the compiler cannot help much, and that is data layout. Big hits come from all levels of cache misses, and it's good for the programmer to be aware of this and benchmark the memory access patterns and try to make them good (predictable, linear, clumping frequently used data, etc). Also on some hardwares, the Load-Hit-Stores are something to be aware as well. A reasonable thing to do, when optimizin
- Re: (Score:3, Informative)
  
  by wirelessbuzzers ( 552513 ) writes:
  
  One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
  Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.
...except for the uControllers I use. (Score:3, Interesting)

by podom ( 139468 ) writes: on Friday January 15, 2010 @01:26AM (#30775570) Homepage

I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).
I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.
But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.

Share
twitter facebook
What about prefetching? (Score:3, Interesting)

by Mr Z ( 6791 ) writes: on Friday January 15, 2010 @11:09AM (#30779052) Homepage Journal

That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.
One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.
On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper [utah.edu] (slides [utah.edu]) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.

Share
twitter facebook
- Re:Could someone give me a crash course (Score:5, Funny)
  
  by Lunix Nutcase ( 1092239 ) writes: on Thursday January 14, 2010 @07:26PM (#30772674)
  
  Probably due to your x86 processor doing all sorts of monkeying with the code.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by __aaclcg7560 ( 824291 ) writes:
    
    Spaghetti code can be hard to digest.
    - Re:Could someone give me a crash course (Score:5, Funny)
      
      by Icegryphon ( 715550 ) writes: on Thursday January 14, 2010 @08:00PM (#30773126)
      
      Spaghetti code can be hard to digest.
      Sounds to me like someone is using stale Copypasta.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by funwithBSD ( 245349 ) writes:
        
        They made the meatballs out of DEADBEEF.
        
        Re: (Score:2)
        
        by networkBoy ( 774728 ) writes:
        
        gotten at the 0xCAFE 0F DEAD BEEF
        
        Re: (Score:3, Informative)
        
        by Eudial ( 590661 ) writes:
        
        I hear they have nice 0xC0FFEE
  - Re: (Score:2)
    
    by TeknoHog ( 164938 ) writes:
    
    Incidentally, my most reliable Flash player is found on a Nokia N800, running Linux on ARM. Fortunately there are ways to download the video file in many cases.
- Re:Code in high-level (Score:5, Insightful)
  
  by caerwyn ( 38056 ) writes: on Thursday January 14, 2010 @07:29PM (#30772710)
  
  That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
  However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Interesting)
    
    by Com2Kid ( 142006 ) writes:
    
    Also, the compiler doesn't always take advantage of instructions that it could use.
    Yah sorry about that. :)
    Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since th
    - Re: (Score:3, Informative)
      
      by cheekyboy ( 598084 ) writes:
      
      intel compilers have options to optimize to more than one target, and its runtime engine uses code that was made for X cpu. Sure your binary is larger, but everyone is happy.
  - Re:Code in high-level (Score:4, Interesting)
    
    by Chris Burke ( 6130 ) writes: on Thursday January 14, 2010 @07:46PM (#30772928) Homepage
    
    That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
    Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.
    However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
    Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.
    
    Parent Share
    twitter facebook
  - Re:Code in high-level (Score:5, Informative)
    
    by Sycraft-fu ( 314770 ) writes: on Thursday January 14, 2010 @07:50PM (#30772966)
    
    Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.
    Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Insightful)
      
      by afidel ( 530433 ) writes:
      
      In general modern compilers are good enough that you are much more likely to get better performance by spending the time finding a better algorithm then you are hand optimizing the code. Obviously for things like H.264 where the algorithm is already set this is not true, but that's a very small fraction of the code out there.
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory. One of the biggest processor slowdowns, especially if you are dealing with a lot of data, is cache misses. If you can align your data in memory on the cache pages, then you can make huge performance gains. Since C doesn't give you much control over this, if you really want to optimize it you have to go to assembly.
    
    Also, some of
    - Re:Code in high-level (Score:4, Interesting)
      
      by dr2chase ( 653338 ) writes: on Thursday January 14, 2010 @08:21PM (#30773324) Homepage
      
      Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).
      Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.
      
      Parent Share
      twitter facebook
    - Re:Code in high-level (Score:5, Informative)
      
      by TheRaven64 ( 641858 ) writes: on Thursday January 14, 2010 @09:44PM (#30774106) Journal
      
      One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory
      I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.
      Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.
      If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.
      
      Parent Share
      twitter facebook
  - Re: (Score:2)
    
    by dbIII ( 701233 ) writes:
    
    Also there is code that is used a lot for a long time.
    For example in geophysics there is a process of arranging data called "Pre Stack Time Migration" which can keep a small cluster busy for a week with relatively small datasets. In cases like that tiny improvements save hours. Only one percent of improvement saves more than an hour in a week.
    - Re: (Score:2)
      
      by wisty ( 1335733 ) writes:
      
      I heard a rumor that there's some fundamental geophysical program that's been around for decades. It doesn't accumulate the results in an array, because memory was too expensive when fortran 66 was the hot new thing.
      It has a write-to-disk instruction in an inner loop. But it works, and nobody wants to touch it.
      A little micro-optimization there would grant a 1000x speedup.
      - Re: (Score:2)
        
        by fuzzyfuzzyfungus ( 1223518 ) writes:
        
        Presumably, if faced with such a program and unwilling to alter it, wouldn't a ramdisk be the logical course of action?
        
        Takes about 30 seconds to set up in most any modern OS, all but the cheapest and nastiest contemporary systems have enough RAM that you can safely carve out something larger than any HDD of the fortran66 era, and(while not as fast as using RAM properly) should run like a bat out of hell compared to any actual disk....
  - Re: (Score:3, Informative)
    
    by RzUpAnmsCwrds ( 262647 ) writes:
    
    It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.
    In contrast, PGI or ICC is much harder to beat using assembly.
    - Re:Code in high-level (Score:4, Interesting)
      
      by TheRaven64 ( 641858 ) writes: on Thursday January 14, 2010 @09:47PM (#30774128) Journal
      
      Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.
      
      Parent Share
      twitter facebook
      - Re:Code in high-level (Score:5, Informative)
        
        by TheRaven64 ( 641858 ) writes: on Friday January 15, 2010 @08:52AM (#30777694) Journal
        
        The GCC manual tells you everything you need to know. First you declare a vector type, so if you want four shorts representing an RGBA colour value , you declare a type like this:
        typedef short colour_t __attribute__ ((vector_size (4 * sizeof(short))));
        
        This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:
        colour_t a = {1,2,3,4}; colour_t b = {1,2,3,4}; colour_t c = a + b;
        
        In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.
        You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.
        Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.
        I wrote a longer explanation a while ago [informit.com].
        
        Parent Share
        twitter facebook
    - Re: (Score:2)
      
      by keeboo ( 724305 ) writes:
      
      It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.
      In contrast, PGI or ICC is much harder to beat using assembly.
      ICC does a great work with auto-vectorization.
      Yet (perhaps it's no longer true), ~3 years ago I had problems with ICC generating wrong code in certain situations. I went back to GCC.
  - Re: (Score:2)
    
    by frank_adrian314159 ( 469671 ) writes:
    
    In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations.
    And that will work until the next rev of the board's chip, which your hardware vendor will change when he wants to and not notify you about. You'll know about it when the customer complaints roll in about poor performance or during your next rev of the firmware when your performance stats go to hell. And, if you're trying to do this for COTS hardware, forget it
    - Re: (Score:3, Insightful)
      
      by The_Wilschon ( 782534 ) writes:
      
      It all depends on your problem domain. As a high energy physicist, I write plenty of code that me, a postdoc, and maybe a couple other grad students will ever see, and probably I'm the only one that will actually ever use it. I'm designing a small cluster that will get built here in a month or few, and some of my code will take up about 2 months of solid run time on it, then never see the light of day again. If I can spend 2 days getting a 5% performance improvement, even at the expense of locking the co
  - - Re: (Score:3, Insightful)
      
      by caerwyn ( 38056 ) writes:
      
      That's *generally* true. It's not *always* true.
      There are a lot of purely compute-bound applications (think simulations of various sorts, etc) for which the algorithmic optimizations have already been done- but it's still worth going for the last few percent of performance from "instruction fiddling". As another poster said: if your app runs for weeks at a time, 1% improvement becomes significant in terms of time saved- and throwing more hardware at the problem isn't always feasible.
- Re: (Score:3, Insightful)
  
  by Thiez ( 1281866 ) writes:
  
  Sometimes it's just plain FUN FUN FUN to code in asm. You're right that most programmers will never have a need for it at all (with some exceptions, such as those messing with operating systems or embedded systems), although knowing some ASM can help a lot with debugging. I suppose one could (read: should) learn a little ASM to have a better idea of what the hardware is doing, this will allow you to optimize your code a little, or (more importantly) write it in such a way that makes it easier for the compil
  - Re:Code in high-level (Score:4, Informative)
    
    by marcansoft ( 727665 ) writes: <hector@nOsPaM.marcansoft.com> on Thursday January 14, 2010 @07:56PM (#30773066) Homepage
    
    Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".
    On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.
    
    Parent Share
    twitter facebook
    - Re:Code in high-level (Score:4, Interesting)
      
      by SETIGuy ( 33768 ) writes: on Thursday January 14, 2010 @10:34PM (#30774502) Homepage
      
      Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.
      Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.
      In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.
      Being able to use %r2 for the same stuff you use %r1 for is just boring.
      
      Parent Share
      twitter facebook
    - Re: (Score:2)
      
      by keeboo ( 724305 ) writes:
      
      Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".
      No need to go that far.
      68k ASM is pure heaven compared to x86.
    - - Re:Code in high-level (Score:4, Informative)
        
        by TheRaven64 ( 641858 ) writes: on Thursday January 14, 2010 @09:49PM (#30774146) Journal
        
        The calling convention is complicated, but it's nowhere near as different as IA32 calling conventions between platforms. Linux and FreeBSD, for example, use different rules for when to return a structure on the stack and when to return it in registers on IA32, but they use exactly the same conventions (the SysV ABI) on x86-64.
        
        Parent Share
        twitter facebook
- Re:Code in high-level (Score:4, Interesting)
  
  by __aaclcg7560 ( 824291 ) writes: on Thursday January 14, 2010 @07:34PM (#30772784)
  
  I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Interesting)
    
    by KC1P ( 907742 ) writes:
    
    That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!
    Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly program
    - Re: (Score:2)
      
      by s73v3r ( 963317 ) writes:
      
      Wow, that sucks. My college ASM class was AWESOME! Granted, it was probably only there to give us a feeling for what was going on under the hood, not to actually learn x86 assembly, but it was taught by a guy who not only was very knowledgeable about the subject, but was also really enthusiastic (even for being upwards of 70!).
  - Re: (Score:2)
    
    by Dunbal ( 464142 ) writes:
    
    I think you can legally get MASM (Microsoft Macro Assembler) somewhere on the internet for free. A good place to start would be Microsoft. Then you can do what real coders do, and teach yourself!
    And to think I paid several hundred dollars for that, back in the day.
    - Re: (Score:2, Informative)
      
      by Anonymous Coward writes:
      
      Or you could get NASM, which is open source :)
      - Re: (Score:2)
        
        by __aaclcg7560 ( 824291 ) writes:
        
        Sweet! The last time that I looked at ASM, I had to run a DOS box under Windows XP that didn't work out too well.
        
        Re: (Score:2)
        
        by KC1P ( 907742 ) writes:
        
        Or you could get WASM (part of the Open Watcom package at www.openwatcom.org) which is open-source AND uses something approaching standard syntax.
        NASM unfortunately falls into the common trap of figuring that, since MASM-style syntax has a lot wrong with it, the syntax should be changed. But as with all such projects, the syntax is changed to fit someone's particular taste, and now you'll write source code which isn't compatible with anything. And IMHO NASM's syntax is no improvement over MASM anyway. AL
      - Re: (Score:2)
        
        by mfnickster ( 182520 ) writes:
        
        Will NASM let you write structured assembly, like MASM?
        I picked up a used copy of Inner Loops [amazon.com] by Rick Booth, and it intrigued me enough to consider tracking down an old version of MASM.
  - Re: (Score:2)
    
    by Kjella ( 173770 ) writes:
    
    I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.
    I'm probably going to need an asbestos suit for this post, but to be honest I don't think assembler is a good programming language for humans. My impression is that they absolutely don't want to pollute the instruction set with instructions unless there's a performance benefit to doing so. But what it means in practice is that anyone I've seen writing advanced assembly relies on lots and lots of macros to do essential things, because the combination of instructions is useful but there's no language construc
    - Re: (Score:2)
      
      by AdamHaun ( 43173 ) writes:
      
      It's not a great language (family) for general use, but it is a good way to learn something about how CPUs work, what a function call actually is, etc.
    - Re: (Score:2)
      
      by ChrisMaple ( 607946 ) writes:
      
      Not many compilers are aware of the video extensions (SSE, etc.), nor are they able to turn even simple loops into code using those parallel extensions. Speedups of 2X, 3X, or more are possible in certain cases.
  - - Re: (Score:2)
      
      by __aaclcg7560 ( 824291 ) writes:
      
      I was learning computer programming at the local community college while working as a lead video game tester. Two-thirds of my classes was Java-centric. When C++ became available again after the college got the money for a renewed Microsoft site license, I took the remaining classes in that language. Ironically, the instructor didn't like the new version of Microsoft Visual Studio and we switched to Linux.
- Re: (Score:3, Interesting)
  
  by dave562 ( 969951 ) writes:
  
  I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into w
- Re: (Score:3, Insightful)
  
  by smash ( 1351 ) writes:
  
  Not quite.
  But, its certainly better to code in a high level language first, test, tweak the algorithm as much as you can, PROFILE and THEN start breaking out your assembler. No point optimising 99% of your code in super fast asm if it only spends 1% of the cpu time in it. Even if you make all that code 10x as fast, you've only saved 0.9% cpu time. :)
- Re: (Score:2)
  
  by SETIGuy ( 33768 ) writes:
  
  In non-trivial single threaded application code on a modern processor, the CPU core is spending about 95% of its time waiting on memory transfers. To fix that problem, it can make sense to prefetch and reorder memory accesses. Chances are you know better than your compiler how to do that. It also makes sense to start more threads on a processor with multiple hardware threads so you can do things while waiting for memory.
  Most programmers won't even bother to do that, because the processor is fast enou
- - Re: (Score:3, Insightful)
    
    by Just Some Guy ( 3352 ) writes:
    
    Someone has to write those tools.
    Yeah, but they can be written in a HLL, too. You don't have to write a program in highly-tuned assembler to make it emit highly-tuned assembler.
    - Re: (Score:2)
      
      by DarkOx ( 621550 ) writes:
      
      You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes. To really over simplify: Compliers are pretty much syntax checkers and search tree engines. They take your code and replace it with a matching assembly listing or set of listings substituting which ever registers happen to be free etc etc.
      - Re: (Score:2)
        
        by Just Some Guy ( 3352 ) writes:
        
        You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes.
        That's news to GCC:
        $ cd /usr/src/contrib/gcc $ find . -name '*.[ch]' | wc -l 869 $ find . -name '*.[ch]' | xargs cat | wc -l 895866 $ find . -name '*.asm' | wc -l 34 $ find . -name '*.asm' | xargs cat | wc -l 6520
        Translation: In GCC 4.2.1 as shipped with FreeBSD 8-STABLE, there are 869 .c and .h files with a total of 900KLOC, and 34 .asm files with 6KLOC. It seems that GCC itself isn't written with very much assembler.
        
        Re: (Score:2)
        
        by Just Some Guy ( 3352 ) writes:
        
        My first "real" programming was using a machine language monitor on a C64, so I feel your pain.
    - Re: (Score:2)
      
      by WilyCoder ( 736280 ) writes:
      
      I've heard that the first C compiler was written in C.
      - Re: (Score:2)
        
        by __aaclcg7560 ( 824291 ) writes:
        
        Uh, no. C was written in B. B was written in A. A was written in leftover naughty bits. :P
- Re: (Score:2)
  
  by MaskedSlacker ( 911878 ) writes:
  
  Try waiting for it to full buffer?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Fast forward... (Score:5, Informative)

Re:Fast forward... (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re:Fast forward... (Score:5, Informative)

Re:Fast forward... (Score:4, Informative)

Re:Fast forward... (Score:5, Informative)

Re:Fast forward... (Score:4, Informative)

Re:Fast forward... (Score:4, Informative)

Re: (Score:2)

Premature optimization is evil... and stupid (Score:2, Insightful)

Re:Premature optimization is evil... and stupid (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re:Premature optimization is evil... and stupid (Score:4, Interesting)

Re:Premature optimization is evil... and stupid (Score:4, Informative)

Re:Premature optimization is evil... and stupid (Score:5, Informative)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re:Premature optimization is evil... and stupid (Score:5, Informative)

It's just outdated knowledge (Score:3, Informative)

Re: (Score:2)

Re:It's just outdated knowledge (Score:4, Informative)

Re: (Score:3, Informative)

Re:It's just outdated knowledge (Score:4, Interesting)

Re: (Score:2)

In C/C++ shift is not the same as multiply/divide (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

Re: (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:Premature optimization is evil... and stupid (Score:5, Insightful)

Re:Premature optimization is evil... and stupid (Score:4, Insightful)

It's not just x86 (Score:4, Informative)

I hate flash video (Score:2)

Kung-Fu and Ninjitsu...They're not dead! (Score:2)

rule of the code (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

...except for the uControllers I use. (Score:3, Interesting)

What about prefetching? (Score:3, Interesting)

Re:Could someone give me a crash course (Score:5, Funny)

Re: (Score:3, Funny)

Re:Could someone give me a crash course (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re:Code in high-level (Score:5, Insightful)

Re: (Score:2, Interesting)

Re: (Score:3, Informative)

Re:Code in high-level (Score:4, Interesting)

Re:Code in high-level (Score:5, Informative)

Re: (Score:3, Insightful)

Re: (Score:2)

Re:Code in high-level (Score:4, Interesting)

Re:Code in high-level (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re:Code in high-level (Score:4, Interesting)

Re:Code in high-level (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re:Code in high-level (Score:4, Informative)

Re:Code in high-level (Score:4, Interesting)

Re: (Score:2)

Re:Code in high-level (Score:4, Informative)

Re:Code in high-level (Score:4, Interesting)

Re: (Score:3, Interesting)