The Impact of Memory Latency Explored 162
EconolineCrush writes "Memory module manufacturers have been pushing high-end DIMMs for a while now, complete with fancy heat spreaders and claims of better performance through lower memory latencies. Lowering memory latencies is a good thing, of course, but low-latency modules typically cost twice as much as standard DIMMs. The Tech Report has explored the performance benefits of low-latency memory modules, and the results are enlightening. They could even save you some money."
Re:Link crashed Firefox (Score:5, Interesting)
Can't Read the Article (Score:2, Interesting)
Ed Almos
Not to harp on the obvious (Score:1, Interesting)
What about cache? (Score:4, Interesting)
Also, outside of the HPC world, it seems very few programmers optimize their cache usage. Are there any tools(open source or otherwise) that can actually help you locate/fix inefficient uses of cache?
So did ExtremeTech - and they included A64 and P4 (Score:5, Interesting)
Re:The underestimated impact of latency. (Score:5, Interesting)
The kind of changes you're talking about require vastly faster memory. Not the kind of latency differences being discussed here at all. Both of these are "high latency" compared to what would be needed for your theoretical redesign of the entire software stack. And even then, you just become utterly and completely screwed if you have to hit virtual memory, possibly more so than you are now because you've re-orchestrated everything around the idea that latency is a non issue.
Oh, and latency is getting worse, not better, and has been for a long, long time. CPU speeds long ago outstripped the speeds of our fastest memory (well, fastest while still not costing absurd amounts of money...), and the newer memory formats (DDR, DDR2, DDR3, RDRAM, etc) have higher latencies in exchange for greater bandwidth.
What does this mean? (Score:2, Interesting)
So talking about optimisation for low-latency RAM is, I suspect, nonsense. What we are surely seeing here is that the actual limitation on memory bandwidth is somewhere else - in the memory controller,in the cache controller, in the CPU fetch rate, in the rate at which stuff is being fetched from hard disk, in bus contention. Overclocking - speeding up memory controllers and buses - will have an effect. Reducing the number of wait states on the memory bus will not have much effect on performance if the total number of active memory cycles in a given period is largely unchanged.
If you had a need for real speed in an application which was not dependent on the graphics subsystem or access to network and HDD, I am sure you could get much more performance out of low-wait state RAM, but you would do it by HARDWARE design, not by software optimisation.
As a simple example from the dim and distant past when I was building hardware, TI used to have a microcontroller called the TMS9995 which ran at, for the day, a hefty 12MHz. With the slow DRAM of the time, it always needed a wait state and this meant that it could manage, as I recall, two memory accesses per microsecond. With static RAM, it could manage 3. The 9995 actually stored its working registers in external memory and so this meant a real world speedup of nearly 30%. The 8088, on the other hand, kept its working registers on-chip and had a limited instruction pipeline. As a result, the equivalent speedup was nothing like 30%. This was due to hardware differences not software differences.
In fact, the applications which really test out the memory subsystem are not games - they are databases and webservers, which hardly use the graphics system at all. And in these cases, for low end systems, the big beast in the equation is cache. It's quite astonishing how a Pentium-M can churn through a badly designed join while a low end AMD 64 struggles, simply because one has 2Mbytes of cache and the other has only 512K. As a result, for ordinary technical laptop and desktop work, I now specify Pentium-M, the AMD 64 with 1Mbyte cache, or pentium-D with 1Mbyte per core. You know it makes sense.(And now everyone can explain why I'm wrong, in my turn)
Re:Just stick a few blue LEDs on it... (Score:4, Interesting)
It's thanks to them that the rest of us can get normal gear at such reasonable prices...
Re:Just stick a few blue LEDs on it... (Score:5, Interesting)
Conclusions
Let's start by talking about the Athlon 64 X2 4200+. This CPU generally offers better performance than its direct competitor from Intel, the Pentium D 840. Most notably, the X2 4200+ doesn't share the Pentium D's relatively weak performance in single-threaded tasks like our 3D gaming benchmarks. The Athlon 64 X2 4200+ also consumes less power, at the system level, than the Pentium D 840--just a little bit at idle (even without Cool'n'Quiet) but over 100W under load. That's a very potent combo, all told.
In fact, the X2 4200+ frequently outperforms the Pentium Extreme Edition 840, which costs nearly twice as much. Thanks to its dual-core config, the X2 4200+ also embarrasses some expensive single-core processors, like the Athlon 64 FX-55 and the Pentium 4 Extreme Edition 3.73GHz. Personally, I don't think there's any reason to pay any more for a CPU than the $531 that AMD will be asking for the Athlon 64 X2 4200+.
If you must pay more for some reason, the Athlon 64 X2 4800+ will give you the best all-around performance we've ever seen from a "single" CPU. The X2 4800+ beats out the Pentium Extreme Edition 840 virtually across the board, even in tests that use four threads to take best advantage of the Extreme Edition 840's Hyper-Threading capabilities. The difference becomes even more pronounced in single-threaded applications, including games, where the Pentium XE 840 is near the bottom of the pack and the X2 4800+ is constantly near the top. The X2 4800+ also consumes considerably less power, both at idle and under load.
The X2 4800+ gives up 200MHz to its fastest single-core competitor, the Athlon 64 FX-55, but gains most of the performance back in single-threaded apps thanks to AMD's latest round of core enhancements, included in the X2 chips. The X2 4800+ also matches the Opteron 152 in many cases thanks to Socket 939's faster memory subsystem. Remarkably, our test system consumes the same amount of power under load with an X2 4800+ in its socket as it does with an Athlon 64 FX-55, even though the X2 is running two rendering threads and doing nearly twice the work. Amazing.
There's not much to complain about here, but that won't stop me from trying. I would like to see AMD extend the X2 line down two more notches by offering a couple of Athlon 64 X2 variants at 2GHz clock speeds and lower prices. I realize that by asking for this, I may sound like a bit of a freeloader or something, but hey--Intel's doing it. No, the performance picture for Intel's dual-core chips isn't quite so rosy, but the lower-end Pentium D models will make the sometimes-substantial benefits of dual-core CPU technology more widely accessible. If AMD doesn't follow suit, lots of folks will be forced to choose between one fast AMD core or two relatively slower Intel cores. I'm not so sure I won't end up recommending the latter more often than the former.
Beyond that, the giant question looming over the Athlon 64 X2 is about availability, as in, "When can I get one?" Let's hope the answer is sooner rather than later, because these things are sweet.
Scientific computing benefits from this (Score:3, Interesting)
Memory latency, and memory bandwidth, both impact how long it takes my simulations to complete. Let's say it is the difference between a simulation taking a week vs. five days... this is significant to me and how much I can get done. With these heavy duty scientific models and such, you really can see a noticable benefit with the fancier hardware, and clock speed is certainly not the the only factor to consider by a long shot.
Re:The underestimated impact of latency. (Score:2, Interesting)
Word/Excel isn't going to bother, but a game might be worth stuffing a few versions of tweaked loops in that are selected by a loop invariant, or by feeding the functions some data ahead of time to help guide them to use the best sizes of data that they can.
This isn't unlike memory alignment for structures, and taking a massive performance hit for the data not being "easy" for the assembly instructions to process.
One example is the ability to loop-unroll the innermost butterflies of an FFT on the x86-64 extension using the extra registers that are available there. That WILL get you a noticeable increase in performance.
But these are always the last 20% kinds of increases...