Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Hardware Technology

The Impact of Memory Latency Explored 162

EconolineCrush writes "Memory module manufacturers have been pushing high-end DIMMs for a while now, complete with fancy heat spreaders and claims of better performance through lower memory latencies. Lowering memory latencies is a good thing, of course, but low-latency modules typically cost twice as much as standard DIMMs. The Tech Report has explored the performance benefits of low-latency memory modules, and the results are enlightening. They could even save you some money."
This discussion has been archived. No new comments can be posted.

The Impact of Memory Latency Explored

Comments Filter:
  • by Maljin Jolt ( 746064 ) on Wednesday November 02, 2005 @12:10PM (#13932714) Journal
    Beware, one of the banner advertiser on that page (netshelter.net) is trying to buffer overflow with strangely crafted cookie. Hope you do not run your Firefox on Windows...
  • by Ed Almos ( 584864 ) on Wednesday November 02, 2005 @12:11PM (#13932728)
    I'm running Firefox 1.0.7 under Ubuntu. When I click on the link firefox exits, am I the only one having this problem?

    Ed Almos
  • by Anonymous Coward on Wednesday November 02, 2005 @12:31PM (#13932919)
    But they're doing this on an AMD-64 platform...
  • What about cache? (Score:4, Interesting)

    by antifoidulus ( 807088 ) on Wednesday November 02, 2005 @12:35PM (#13932947) Homepage Journal
    Improvements in memory speed crawl compared to improvements in CPU speed, however larger caches can mitigate this problem to a certain extent, so why is it that growth in cache size continues to crawl? The Apple G5 updates FINALLY gave us 1mb l2 cache per core(and of course the industry standard 64k L1 cache per core) and whil the Intel/AMD world is slightly better in this regard, it's not by much. So why is it so hard to increase cache size?(of course you will need good cache allocation/replacement policies to go with them)? I'm not trolling, I honestly want to know. I realize that the people that design these chips are a lot smarter than I, but so far I haven't really seen a good reason why they don't increase cache size.
    Also, outside of the HPC world, it seems very few programmers optimize their cache usage. Are there any tools(open source or otherwise) that can actually help you locate/fix inefficient uses of cache?
  • by freidog ( 706941 ) on Wednesday November 02, 2005 @12:43PM (#13933023)
    ExtremeTech Article [extremetech.com]
  • by Zathrus ( 232140 ) on Wednesday November 02, 2005 @01:05PM (#13933256) Homepage
    Sorry, I call BS on your entire post. The difference in latencies here is miniscule -- it's not like we're talking about having the CPU wait 2 clock cycles vs 30 clock cycles. It's closer to 13 vs 25 (not exact, but the magnitude of difference is close). That just doesn't matter that much -- the reality is that if you have a cache miss then you're looking at 20-30 cycles (or, more likely, 40-60 cycles) of stall while you fetch the data from main memory.

    The kind of changes you're talking about require vastly faster memory. Not the kind of latency differences being discussed here at all. Both of these are "high latency" compared to what would be needed for your theoretical redesign of the entire software stack. And even then, you just become utterly and completely screwed if you have to hit virtual memory, possibly more so than you are now because you've re-orchestrated everything around the idea that latency is a non issue.

    Oh, and latency is getting worse, not better, and has been for a long, long time. CPU speeds long ago outstripped the speeds of our fastest memory (well, fastest while still not costing absurd amounts of money...), and the newer memory formats (DDR, DDR2, DDR3, RDRAM, etc) have higher latencies in exchange for greater bandwidth.
  • What does this mean? (Score:2, Interesting)

    by Flying pig ( 925874 ) on Wednesday November 02, 2005 @01:27PM (#13933451)
    All memory has an access time, and the further you get from the CPU the longer it is going to be. CPU registers have the shortest access time, with (nowadays) subnanosecond access. L1 cache comes next, then L2, then external RAM, then HDD, and finally the slow backing store represented nowadays by CD and DVD. This heirarchical memory architecture changes with time mostly in that the caches grow bigger, so the 640K of RAM from DOS days now fits into the cache of each processor in a pentium-D with room to spare, and a Pentium-M could in theory run DOS with extended and expanded memory without needing any external RAM at all. (I'd almost like to try that.)

    So talking about optimisation for low-latency RAM is, I suspect, nonsense. What we are surely seeing here is that the actual limitation on memory bandwidth is somewhere else - in the memory controller,in the cache controller, in the CPU fetch rate, in the rate at which stuff is being fetched from hard disk, in bus contention. Overclocking - speeding up memory controllers and buses - will have an effect. Reducing the number of wait states on the memory bus will not have much effect on performance if the total number of active memory cycles in a given period is largely unchanged.

    If you had a need for real speed in an application which was not dependent on the graphics subsystem or access to network and HDD, I am sure you could get much more performance out of low-wait state RAM, but you would do it by HARDWARE design, not by software optimisation.

    As a simple example from the dim and distant past when I was building hardware, TI used to have a microcontroller called the TMS9995 which ran at, for the day, a hefty 12MHz. With the slow DRAM of the time, it always needed a wait state and this meant that it could manage, as I recall, two memory accesses per microsecond. With static RAM, it could manage 3. The 9995 actually stored its working registers in external memory and so this meant a real world speedup of nearly 30%. The 8088, on the other hand, kept its working registers on-chip and had a limited instruction pipeline. As a result, the equivalent speedup was nothing like 30%. This was due to hardware differences not software differences.

    In fact, the applications which really test out the memory subsystem are not games - they are databases and webservers, which hardly use the graphics system at all. And in these cases, for low end systems, the big beast in the equation is cache. It's quite astonishing how a Pentium-M can churn through a badly designed join while a low end AMD 64 struggles, simply because one has 2Mbytes of cache and the other has only 512K. As a result, for ordinary technical laptop and desktop work, I now specify Pentium-M, the AMD 64 with 1Mbyte cache, or pentium-D with 1Mbyte per core. You know it makes sense.(And now everyone can explain why I'm wrong, in my turn)

  • by vmcto ( 833771 ) * on Wednesday November 02, 2005 @01:43PM (#13933594) Homepage Journal
    Hey don't knock gamers that spend tons of money on computer gear.

    It's thanks to them that the rest of us can get normal gear at such reasonable prices...
  • Have to agree with AC on the cpu issue, taken from the http://techreport.com/reviews/2005q2/athlon64-x2/i ndex.x?pg=16 [techreport.com]

    Conclusions
    Let's start by talking about the Athlon 64 X2 4200+. This CPU generally offers better performance than its direct competitor from Intel, the Pentium D 840. Most notably, the X2 4200+ doesn't share the Pentium D's relatively weak performance in single-threaded tasks like our 3D gaming benchmarks. The Athlon 64 X2 4200+ also consumes less power, at the system level, than the Pentium D 840--just a little bit at idle (even without Cool'n'Quiet) but over 100W under load. That's a very potent combo, all told.

    In fact, the X2 4200+ frequently outperforms the Pentium Extreme Edition 840, which costs nearly twice as much. Thanks to its dual-core config, the X2 4200+ also embarrasses some expensive single-core processors, like the Athlon 64 FX-55 and the Pentium 4 Extreme Edition 3.73GHz. Personally, I don't think there's any reason to pay any more for a CPU than the $531 that AMD will be asking for the Athlon 64 X2 4200+.

    If you must pay more for some reason, the Athlon 64 X2 4800+ will give you the best all-around performance we've ever seen from a "single" CPU. The X2 4800+ beats out the Pentium Extreme Edition 840 virtually across the board, even in tests that use four threads to take best advantage of the Extreme Edition 840's Hyper-Threading capabilities. The difference becomes even more pronounced in single-threaded applications, including games, where the Pentium XE 840 is near the bottom of the pack and the X2 4800+ is constantly near the top. The X2 4800+ also consumes considerably less power, both at idle and under load.

    The X2 4800+ gives up 200MHz to its fastest single-core competitor, the Athlon 64 FX-55, but gains most of the performance back in single-threaded apps thanks to AMD's latest round of core enhancements, included in the X2 chips. The X2 4800+ also matches the Opteron 152 in many cases thanks to Socket 939's faster memory subsystem. Remarkably, our test system consumes the same amount of power under load with an X2 4800+ in its socket as it does with an Athlon 64 FX-55, even though the X2 is running two rendering threads and doing nearly twice the work. Amazing.

    There's not much to complain about here, but that won't stop me from trying. I would like to see AMD extend the X2 line down two more notches by offering a couple of Athlon 64 X2 variants at 2GHz clock speeds and lower prices. I realize that by asking for this, I may sound like a bit of a freeloader or something, but hey--Intel's doing it. No, the performance picture for Intel's dual-core chips isn't quite so rosy, but the lower-end Pentium D models will make the sometimes-substantial benefits of dual-core CPU technology more widely accessible. If AMD doesn't follow suit, lots of folks will be forced to choose between one fast AMD core or two relatively slower Intel cores. I'm not so sure I won't end up recommending the latter more often than the former.

    Beyond that, the giant question looming over the Athlon 64 X2 is about availability, as in, "When can I get one?" Let's hope the answer is sooner rather than later, because these things are sweet.

  • by Orp ( 6583 ) on Wednesday November 02, 2005 @02:49PM (#13934233) Homepage
    I do large 3D thunderstorm simulations. With some of the larger simulations I am integrating lots of things, contained in 3D floating point arrays, over 1 billion or more gridpoints (using distributed computing, such as a beowulf cluster made up of dual Xeons or an SGI Altix system). Each scientific calculation requires accessing floating point values stored in these arrays, doing some math, and updating another array.

    Memory latency, and memory bandwidth, both impact how long it takes my simulations to complete. Let's say it is the difference between a simulation taking a week vs. five days... this is significant to me and how much I can get done. With these heavy duty scientific models and such, you really can see a noticable benefit with the fancier hardware, and clock speed is certainly not the the only factor to consider by a long shot.
  • by Woody77 ( 118089 ) on Wednesday November 02, 2005 @06:26PM (#13936151)
    However, if you have algorithmicly intensive software (spending lots of time in the same loops or crunching large amounts of data), it's worthwhile to instrument your code and see how you're doing for cache hits/misses. You might discover that by tweaking the inner-most loops or the size blocks you crunch, you can better fit the cache of the target processor.

    Word/Excel isn't going to bother, but a game might be worth stuffing a few versions of tweaked loops in that are selected by a loop invariant, or by feeding the functions some data ahead of time to help guide them to use the best sizes of data that they can.

    This isn't unlike memory alignment for structures, and taking a massive performance hit for the data not being "easy" for the assembly instructions to process.

    One example is the ability to loop-unroll the innermost butterflies of an FFT on the x86-64 extension using the extra registers that are available there. That WILL get you a noticeable increase in performance.

    But these are always the last 20% kinds of increases...

"Ninety percent of baseball is half mental." -- Yogi Berra

Working...