Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Hardware

The Fundamentals Of Cache 60

Dave wrote to us with an article currently running on SystemLogic that delves into caching, with specific examples taken from the Athlon and PIII processors. It also talks about the different types of cache - fairly technical, but an all around good read.
This discussion has been archived. No new comments can be posted.

The Fundamentals Of Cache

Comments Filter:
  • You cheeky man you!

    In some ways you are demonstrating the effect of the cache, yes. However you assume 2-way associativity. Some processors are 4-way (increase the number of accesses to be 5 with the same low 12 bits).

    _BUT_ you're also forcing the "cache unfriendly version" to have several other speed problems:

    - It isn't fair to certain processors as there are restrictions on using addresses with the same low 4 address bits back-to-back (original Pentium certainly).
    - You've chosen to access non-aligned data (accessing 16 bit values on an odd, and cache line spanning, address)

    References:
    Abrash - The Zen of Assembly Language Optimisation
    Agner Fog - Pentopt (www.agner.org methinks)

    FatPhil
  • by Cenotaph ( 68736 ) on Tuesday October 17, 2000 @03:44AM (#700100)
    Actually, the article is correct.

    Modern processors have all sorts of things to deal with latency. Branch prediction, multiple issue slots, out-of-order execution, etc. However, this hardware generally requiers a group of instructions to work on, the more the better. Keeping a large number of instructions around for the processor to use requires more bandwidth.

    That being said, when you have adequate bandwidth, latency does become more of a problem. That's why chip multiprocessors (CMPs) and Simultaneous Multithreading Processors (SMTs) are becoming a large focus of current processor research.


    --
    "You can put a man through school,
    But you cannot make him think."

  • Caches are extremely useful when you are out there in the wilds. You can't really be expected to carry all sorts of crap with you all the time, so it's really convenient to have someone else put things into caches for you to find. The ultimate strategy for colonizing Mars [mars.com], I presume.
  • First: Nice troll account. (s/Shooboy/Shoeboy/, eh?).

    Second, there is a lot of research out there on cache technologies, including such out-there thinking. One that's on perhaps the "lip" of the box (not quite in the box, but not really out of the box) is the stride-predicting cache, which tries to prefetch data based on CPU access patterns. One that I think is really out-of-the-box is value prediction, that is, guessing what a value read from memory might be based on previous executions of the same instruction. (You'd be surprised how effective "guess zero" is!)

    The thing is, you're not going to see many of these new features on mainstream processors for a number of years, because many of these ideas take time to really reach maturity. Remember, the Tomasulo algorithm (also known as register renaming) was developed in the 60s, but didn't show up on the desktop until x86's sixth generation (the PentiumPro).

    --Joe
    --
  • I am intimately famiar with x86 CPU

    Sigh. Meant to say that I am NOT intimately famiar with x86 CPUs, but...

    --

  • From another perspective, a level of cache cuts the bandwidth requirement by about an order of magnitude. (VERY rough approximation, I agree.) In the 386 world, we saw L1 cache beginning at about 25MHz, though there were some uncached 33MHz designs. When we moved to the 486, in addition to the L1 being moved on-chip, the cycles-per-instruction improved, so it worked faster at a given clock speed, and required a better memory system behind it. So round about 66MHz, we saw the L2 cache start becoming normal. This continued with Pentium, though there were a few miserable failures that attempted to do without the L2.

    At the same time, main memory is improving. Straight full-RAS-access begat page mode, which begat fast-page mode, which begat EDO, which bagat SDRAM, which begat DDR. But at the same time as main memory moved from 150nS access, 300nS cycle to 40nS access, 70nS cycle, with 7.5nS burst rate, CPUs have moved from 4.77MHz to 1.1 GHz.

    I've already heard of an experimental Micron Northbridge chip that incorporates a bunch of fast DRAM-based L3 in otherwise unused area. Northbridge chip sizes are driven by pincount, not circuit area. They are prone to waste silicon just to get enough I/O pins bonded out.

    I anticipate the widespread deployment of L3, any month now.
  • That's what a cache is, fast memory that keeps items that may be needed again in it there are constant improvements being made in the strategies by which a cache decides which things are going to be most usefull (which depends largely on what your using the machine for) There are other ways to speed up processors but seeing the article is about caches thats what it talks about.
  • by mav[LAG] ( 31387 ) on Tuesday October 17, 2000 @04:53AM (#700106)
    Gah - that will teach me to use the preview button :)

    It isn't fair to certain processors as there are restrictions on using addresses with the same low 4 address bits back-to-back (original Pentium certainly).

    True - this example is meant to show how the Pentium (PPLain in Agner's docs) can be tripped up.

    - You've chosen to access non-aligned data (accessing 16 bit values on an odd, and cache line spanning, address)

    Very deliberately I might add :) The point I was making was that cache considerations play a big role in optimising inner loops. The above code is an example written by Niklas Beisert a.k.a Pascal of Cubic Team to show the effects of cache misses when doing snazzy bitmap effects.

  • What?
  • To be a cache, you must actually cache data that exists in another memory level. That 'exclusive level 2 cache' contains data from main memory. The level 1 cache contains data from the main memory. Both contain data from the same level, so both are level 1 cache, and there is no level 2 cache.

    Quote from Computer systems design and architecture by Vincent P. Heuring and Harry F. Jordan:
    Where there are two cache levels between the processor and main memory, the faster level is called the primary cache and the slower level is called the secondary cache.

    Actually, this secondary level cache is not another level, it is only a slower extension of the primary cache used (logically) for the least useful part of the first level cache.

    I don't think that the difference of speed between these two parts of the primary cache could make them two cache levels

    phobos% cat .sig
  • Are you saying that forking of the X86 architecture and software written for it is inevitable?
  • Latency and bandwidth go hand-in-hand. They affect one another directly.

    If something is hogging the bandwidth of the RAM bus, this means that other requests for RAM access must be delayed until the hog is finished. This directly causes the latency of the of other request(s) to increase.

    Developing an architecture that minimizes bandwidth usage and latency is really tricky, and you can end up with all kinds of crazy mechanisms and whacky access protocols to facillitate this.

    SirPoopsalot
    SPARC64 L3 Cache Architect
    HAL Computer Systems [hal.com]

  • Oooh, song-time!

    All you need is cache!

    Brother, can you spare a time(slice)?

    What do you want, Pavarotti?

    Fiiiiifo, fifo, fifo, fifo, fiiiiiiiiiiifo! (lamely to the tune of 'Figaro')

    Sorry, folks, just couldn't resist. :)

    ---
    Hold the mold, Klunk.
  • I don't mean that level 1 cache doesn't contain data from the main memory when there is level 2 cache... i mean that the level 1 memory caches the level 2 memory, and therefore it contains memory from the main memory but only because the level 2 cache contains data from the main memory.

    phobos% cat .sig
  • This sounds like a "volatile" datum. It shouldn't be in the cache at all. DSP architectures I know have the "Direct RAM Load/Store" operations to bypass the cache, which solves this problem. FatPhil
  • (Some, such as Alpha, go to Herculean extremes with a gigantic reorder buffer and a cache which allows four or five outstanding misses to pend while still allowing hits in the cache.)

    You want to hear Herculean?

    The processor I am helping design allows this:

    • The 512kB L2-Data cache can have 16 outstanding misses while still servicing HITs and non-cacheable transactions.
    • The 512kB L2-Instruction cache can have 8 outstanding misses wile still servicing HITs and noncacheables.
    • The 64MB (!!) L3-Unified cache can have all 24 of the L2 misses also outstanding, while still servicing HITs, victims from the L2D cache, as well as copy-backs non-cacheables, system controller requests, and non-cacheable transactions.
    • Request order is not maintained, but rather we services the requests as soon as it is possible.

    Sound like overkill? Maybe... but schemes like this explain why my company's 350MHz chips out-perform Sun's 450MHz equivalent. We spend a hell of lot less time accessing memory on the UPA interface, and we can more outstanding misses on that interface as well.

    SirPoopsalot
    HAL Computer Systems [hal.com]

  • Ah, another informative link with which to fatten out my backflip.com collection [backflip.com]. This is the official reason I spend so much time on slashdot. The real reason is that it's an excuse to share war stories. Like this one.

    Once upon a time, a programmer couldn't figure out why an SGI Origin IPC app was slower with 4 meg buffers than with 1 meg buffers. Buffer size is directly related to I/O performance, right? An impatient SGI development engineer said, "Cache blowout! Next!" A slow-witted tech writer couldn't quite follow what the engineer was saying, and also had his breed's supestitious aversion to undefined jargon. "If a programmer thinks you can increase buffer size forever, he's as ignorant about 'cache blowout' as I am, right?" After a friendly shouting match and a few mild death threats, the engineer finally explained the concept. And they all became friends again -- until the next bug.

    You can read the result at support .sg i.com [sgi.com] (free registration required; ignore the request for an SGI serial number).

    __________

  • Well, optimization is generally non-portable :)

    But you ask a good question that deserves a serious answer: Yes, it should be possible to implement portable run-time cache blocking. Two parts: First the cache detection; second, break the code into variable sized blocks.

    Portable cache detection isn't quick, but it is easy: Allocate a big hunk'o'RAM bigger than expected caches. Initialize it to get off the CoW zeropage. Time repeated (10 minimum) variable length rips. Start with 1KB so you will be on-scale with gettimeofday(). Step up by powers of two, keeping careful note of gettimeofday() results. There will be big steps when you cross cachesize boundaries. Even a simple programmed algorithm can find these. The whole process will cost at least one second, and perhaps 10.

    The second part, variable sized blocks complicates the code considerably. You nest the base code inside a loop and that base code has a smaller dataspan plus an offset updated in the outerloop. Inner and outer loopcounts tuned to cachesize by automagic cache detection.
  • Latency is important to the extent that it limits bandwidth. As other posters point out, modern CPUs have many mechanisms for dealing with latency to a certain extent. (Some, such as Alpha, go to Herculean extremes with a gigantic reorder buffer and a cache which allows four or five outstanding misses to pend while still allowing hits in the cache.)

    In the end, the raw amount of work performed is measured in terms of bandwidth. To process N items, you need to touch N items, and how quickly you touch those N items is expressed as bandwidth.

    As a person who programs a deeply pipelined CPU [ti.com], I can attest that latency can affect some algorithms (especially general purpose control algorithms) more than others, since it limits how quickly you can process a given non-bandwidth-limited task. However, for raw calculation (eg. all those graphics tasks and huge matrix crunching tasks the numbers folks like to run), those tasks are fairly latency tolerant and just need bandwidth.

    This is why number-crunching jobs might work well with, say, RAMBUS, but desktop jobs might work better with DDR SDRAM, even if the two are at the same theoretical bandwidth node.

    --Joe --Joe
    --
  • "The point I was making was that cache considerations play a big role in optimising inner loops. The above code is an example written by Niklas Beisert a.k.a Pascal of Cubic Team to show the effects of cache misses when doing snazzy bitmap effects."

    Another point that should be made is the importance of processor specific optimising compilers. Saying that one processor runs code slower than another is moot, it's the fault of the compiler generated code. However, saying that one cache architecture supports the majority of current comiler optimisations better _is_ a valid point.

    But if we're going to start talking on the level of new hardware (associativity and prediction etc), then we have to start considering new compilers that take advantages of the different hardware as well.

  • A problem with this occurs when you have a multiprocessor system. If you have a line in an L1 cache that you want to write to (from the other processor, to make things interesting), you need to hunt down the line in not just an L2 or L3 cache, but also L1 and pull it into your own L1. This'll increase your latency because of the time it takes to find that line. You can see how this could be a pretty vicious thing if you have a heavily-contended lock in a particular cache line (and it's almost guaranteed to make a 1-by system perform better than a 2-by or greater system).

    You can have directory mechanisms (that could add up to gigabytes of additional memory on larger systems, not used for storage) to quickly look up locations in a centralized place. You won't find this sort of thing on a desktop PC (usually a 1-by or 2-by, anyway), but on the larger machines (e.g. 8 processors or more, 16GB of RAM or more) it's definitely an issue.

  • It's true that most optimizations are not portable, but there is something that can be done about this. Properly ifdef'd code can be optimized for the appropriate iteration of the platform by any moderately well-designed compiler. You don't want to be using runtime variables more than you have to, and certainly not for memory requests for caching, but you can handle it in the source code, provided that you actually let your users have the code so they can compile it themselves.
  • by n3rd ( 111397 ) on Tuesday October 17, 2000 @03:11AM (#700121)
    But surely it's latency that is of primary importance and bandwidth takes second place?

    At first I thought you were right, but the more I thought about it, the more I feel the article is correct.

    The best way to think of this is on high end systems. Think of a Sun Ultra Enterprise 450 with a couple gigs of RAM, a couple processors and a bunch large, memory and CPU intensive programs running.

    You have 20 processes each which needs a slice of CPU time. Each time a process runs on the CPU, parts of the process are copied from RAM to the CPU to execute. That processes executes for its specified time slice, the kernel stops it, copies the results back to RAM, and then does it again with another process. Now imagine this happening hunderds of times per second! The memory bus gets even more saturated with more processors since there are more RAM to CPU copies and vice-versa.

    This is where cache comes in. Part of the program that just executed is kept in the cache so the next time it's time slice comes around (ie: it's time to run on the CPU), there won't have to be a copy from RAM to the CPU. The CPU simply grabs it out of it's cache, thus freeing up bandwidth on the memory bus.
  • An excellent article, and somewhat deeper than some of the stories that go by. I liked the pyramid representation of the general balance between speed/cost. The version I was taught (a few years ago) placed cpu registers at the top of the pyramid.
  • My whole "Computer architecture and design" letures are coming back to me. It makes a great project as well, writing a program that takes an input file of memory addresses that might be used in the course of a programs execution and watch as different blocks are added and taken away from your little simulated cache. A good way of learning!

    dnnrly

  • The amusing thing is when you get cd-writer manufacturers going `look at our new product! No more buffer underruns!` and give it a trademark and shit. Oh, you`ve added a few k of ram and implemented.... a buffer?! You pioneers! You just dont care do you, thats proper blue-sky research that is!
  • I think maybe the reason for the lack of innovation is because designers' are trapped in that pyramid drawing the article shows. The faster the cache, the more expensive is gets, thus limiting the amount you can practically have.

    Plus, increasing cache sizes, decreases any benefit.
  • This is one simple and straight article and can be understood by anyone. Wish people would follow this principle of simplicity rather than confuse everyone with super-technical jargon.
  • I cache my hifi remote control on the bed-side table.
    My girlfriend flushes the cache, and puts it back to next to the hifi (this leads me to believe she has twice as many X chromosomes as necessary)

    In this process, whatever state the remote is in one thing remains constant - there is only _one_ remote control..

    In the olden days, RAM caches were not like that.
    The cached information was duplicated in all lower cache levels. But, fortunately, if you read on in the article, you get to this:
    "
    Exclusive cache designs mean that the information contained within one layer is not contained within the layer above it. In the Thunderbird and Duron, this means that the information in the L2 cache is not contained within the L1 cache, and vice versa.
    "

    Now _that's_ more like the paradigm I'm used to.
    I've been told it's a fair sized win for AMD's chips, but I'll reserve judgement until I get my hands on one. However, I'll say in public - I am a believer...

    FatPhil
  • by mav[LAG] ( 31387 ) on Tuesday October 17, 2000 @03:56AM (#700128)
    fast:
    mov dx,12
    l1:
    mov cx,32768
    l2:
    mov ax,[0]
    mov ax,[0]
    mov ax,[0]
    dec cx
    jnz l2
    dec dx
    jnz l1

    slow:
    mov dx,12
    l3:
    mov cx,32768
    l4:
    mov ax,[4095]
    mov ax,[8191]
    mov ax,[12287]
    dec cx
    jnz l4
    dec dx
    jnz l3

    Nearly identical loops, except the first one flies and the second one thrashes because it misses the cache 100% of the time. Can you say 10-50 times slower?
  • I thought that was what Slashdot was all about..?
  • I've been waiting for this topic to come up!

    I'm only a nerd-wannabe, so bear with me. My question is will there ever come a day when cache, RAM, and hard drives become one? More to-the-point, I was thinking about RAM disks and how great they are, except you have to be careful to copy the contents back when you shut-down the computer. When I say will they become "one", I know they're separate for efficiency, but maybe what I mean to ask is will I always have to lose the RAM contents when I turn the power off, why not have the whole drive stored in RAM. Hmmm. Good suggestion. Take us back to the Atari XL series... Oh well whatever.

  • by kinnunen ( 197981 ) on Tuesday October 17, 2000 @04:03AM (#700131)
    Context switching doens't really take insane amounts of bandwithdth. You copy the contents of all the registers to memory (including all registers not visible to programmer), then copy the registers of the new process from RAM to CPU. I am intimately famiar with x86 CPU, but I'd guesstimate this shouldn't take more than maybe half a kilobyte of memory transfers total. Do that 400 times and second a we are talking about one fifth of a megabyte of used bandwith per second. I can live with that.

    Of course you are right in that cache does increase available bandwidth and that this is a good thing, but latency really is the thing we want to cure. For the original Athlon (though I'm guessing this is one of the faster models with 1/3 L2 dividor), 24 clock cycles go to waste even if the data is on L2, and even that is considerably faster than DRAM. Add to this that x86 is a really architecture because almost every instrucion can (and will) reference memory at least once (in addition to the mandatory instruction fetch).

    --

  • Maybe the definition of level 2 cache is different now with different approaches, but level 2 cache is usually (normally?) bigger than level 1 cache because it is a cache level... if I split the level 1 cache on two chips it stays level 1 cache. If I cache the memory cache (becomes complicated;) then there is another level of cache.

    What I mean is that saying that the part of level 2 cache that is already contained in level 1 cache is lost memory that should be used for a different purpose is an error... because a cache by definition contains a copy of data. Putting a section of L1 contents on another chip doesn't make it another cache level! Does it?

    phobos% cat .sig
  • Unfortunatly yes. If we want manufactures to come up with new ways of improving performance, programmability then they will end up diverging over the short-term. This isn't want we want or need to happen but with the usual I thought of it first, I'll sue you now childish arguements...
  • I've got some questions however:
    1. I've heard about stackable processors a time ago. Anybody know what's happened to 'em? This would let us use multichip solutions without expensive cartriges. The only reason against this I can see is a cooling problem.
    2. Anybody know how exactly the cache searching works? What's wrong with totally-associative cache (except of possibly large number of transistors)?
    3. What's good with Harvard architecture (separate instructions/data)? Never understood this too.
    4. Why not start searching in all caches at the same time?
    5. Why didn't AMD preserved some space in exclusive cache system to save from 20 ticks delay mentioned?
    6. Why don't they make planes with the same technology as flight recorders, and does air rotates in tires? :-)))

    ---
    Every secretary using MSWord wastes enough resources
  • The really worrying thing I found while reading the article, was how little innovation there is in cache design.

    Mebbe we should get Bill Gates to take a look at the problem.

  • Urgl... the link got bogarted. Again, that's a deeply pipelined CPU [ti.com]. To give you an idea of how deeply pipelined it is, a "Load Word" instruction takes five cycles, meaning I issue the "Load Word", wait four cycles, and then I get to use the result. :-)

    --Joe
    --
  • Don't be so pessimistic... If the forking is truly inevitable then it's possible for a consortium to form, converge and drive the innovation forward. The industry proved a model like this is possible with the formation of groups like ASCII, JEDEC, IEEE and a bunch of others with even more arcane acronyms.

    Yeah, I know everyone is about 100x more litigous then back in "those days" but I don't think it can go on like this forever.

    Is this off topic or what?? :)

  • Absolutely! This technique is called "cache blocking" -- break up data into cache-sized chunks and process a sub-frame then stitch subframes together. Not easy programming, but definitely worthwhile.

    Much faster than any "natural order" long array ripping unless the processing is really minor in which case main RAM bandwidth rulez.
  • Funny you should bring up SPARC processors.

    fwiw, SPARC chips have "register windows" which save you from pushing/popping them all to the stack across calls. Instead, you just do a window-shift. You only have to put the registers to ram in the event you have a "window overflow" (there are no more register "sets" left)

    This is of course all within a process. SPARC cpus also have somethign called "hardware contexts", which I unfortuneately haven't read much on, but my assumption is that these are different from the register windows and provide some sort of optimization for context switching in the metal.

    Cache _latency_ is the important point. Bandwidth is nice, but its latency that kills you. Super pipelining is _defined_ as having a clock freq faster than the tiem to retreive from L1 cache. If you've got an L1 thats running at 1/2 your pipeline speed, essentially for every l/s you've got 1 extra cycle before the result can be used in the BEST CASE (l1 hit). That effectively means you'd need something like:

    Load x
    NOOP
    do something with x

    of course, this sucks, and so controllers try and do reordering and alll that other shit they do now just to get around all the delays incurred by having the memory subsystem so vastly slower than the cpu.

    The obnoxious tricks required to get the processor to not spend all of its time stalling have contributed to the bloat of the RISC controller logic, so much so that there really aren't RISC chips any more - they're all complex and nasty and the controllers are responsible.

    This is the appeal of LIW and/or VLIW. Force the compiler to handle all instruction ordering, all data hazards, and all pipeline delay issues. The cpu controller gets much much simpler and the cpu gets that much faster, not to mention you get the benefit of having _lots_ of EUs instead of just the few (1-4, an IBM chip has 6) you're limited to in a superscalar when you're resolving all these issues real-time in the controller
  • Despite the common misconception that electricity flows at the speed of light, it does not.

    Hmmm...this depends on how you look at it, I guess. Signals propagate along the wire at the speed of light. Apply voltage to a wire and that voltage will propagate at the speed of wire. HOWEVER, individual electrons do not move at the speed of light; I've read that they can be as slow as 1 inch / second (that's 1 cm / second, for you metric types - bonus points if you can figure out why the conversion factors don' matter).

  • > Am I missing something??

    Yes. The lines in the exclusive L2 got there because they were kicked out of L1 by a miss. But there's still a chance that they'll be needed again, soon. That's why they're kept in L2. When a line gets thrashed out of L1, but manages to stay in L2, it can be quickly gotten when L1 needs it, again.

    The same is essentially true of a normal, 'inclusive' L2 cache.

    The reason an exclusive L2 *now* makes sense is that they're both on the same chip! That's the fundamental difference, because now it's cheap in terms of silicon and circuitry to do the bookkeeping between L1 and L2. This lets you avoid storing a second copy of the data in L1.

  • I hope I understand your problem. Your question is whether an exclusive L2 cache can really be considered L2 cache vs being considered a non-uniform L1 cache. I think the key here is what data is contained in the two caches. The L1 fast cache contains the most recently used data. The slower L2 cache contains data that was used less recently and booted out of L1, from this perspective I think you can consider the exlusive L2 as an L2 cache. Dastardly
  • The AS/400 works like this. There is only one address space. It's stored on the disk. (The address space is so freaking huge that no practical amount of disk could cover it.) You can think of RAM as a cache for the disk - when you want to work on something (program segment, database record, whatever), it gets read into memory so it's closer to the processor, but it's still sitting at its address, and it gets written back out if it gets changed (managed by the hardware).


    ...phil
  • The short answer to #2 is that each layer of associativity (2-way, 4-way, 8-way, etc) introduces a *theoretical minimum* of one gate delay to the cache logic, and actually it's more. Also there are diminishing returns involved. There's a similar problem with just making larger caches, where each successive doubling of cache size has a consequence of slowing the responsiveness of the cache, because of a similar added-gates-to-critical-path problems. Size and associativity both increase latency. Optimal cache design takes into account how much latency can be afforded, then balances both the associativity and cache size to maximize the cache hit rate as much as possible, in as many different circumstances as possible.

    The short answer to #3 is that the odds of reusing instructions is much higher than reusing data. memcpy is the ultimate example of this, where you have a tight loop doing nothing but reading and writing huge swathes of data. If your instruction cache is separate from your data cache, your instructions will never be bumped out by the reams of data that they are processing.

    The short answer to #4 is that it distracts the slower caches too much. They should be available to handle whatever glacial tasks they already have queued up (like writing back dirty cache lines, etc) rather than waste their time doing lookups before they're *sure* that they need to.

    I don't have any short answers to the other questions.

  • From the article:
    /* From this definition, it follows that a cache-line is also the smallest unit of memory that can be transferred between the cache(s) and the processor */

    I beg your pardon? A typical cache line is (e.g) 32 processor words...if you told me your processor architect made decisions such that I had to load 32 words to the CPU every time I accessed a memory location, I'd tell you you need a new architect! :-) Though the rest of the article seems fairly accurate (I'm not done reading yet) this is definitely wrong. The minimum amount of data that can be transferred between the cache and the processor depends on the bus width between the cache and CPU and whether or not the CPU will ignore some of the lines, for example if you have a 32-bit bus but the processor chooses to only pay attention to the low or high order 16. Anyway, this is silliness and clearly does NOT follow from the quote in the article.

    --denim
  • In terms of non-portability concerns, knowing the cache of the device you are using is good. If the software has some way of probing /proc/ or the kernel to learn how big the caches are, would it be too difficult to make the algorithms scale to what you have?

    It sounds painful to discover what it is, unless you have some tuning software to find what size of data blocks is more efficient, that way you are considering other possible bottlenecks as well.
  • Of course, days of yore, we were concerned with reading a block from tape that was close to a convenient memory size, so often wasted tape space rather than deal with the overlap.
  • I don't need no stinking cache...I have a creditcard.
  • The really worrying thing I found while reading the article, was how little innovation there is in cache design. Small improvements yes, but little in the way of real root and branch rethinking.

    The basic idea of a cache - keeping the most used objects sitting within easy reach - is as old as humanity.

    --Shoeboy

  • All you need is cache!

    All you need is cache!

    All you need is cache, cache!

    Cache is all you need!

    Ok, that was lame. But it's 9am and I haven't had my caffiene yet. What do you want, Pavarotti?

  • How many of you independent web developers have been frustrated by a myriad of caches between you and your hosting provider? And how many times have your customers said "we can't see the changes!" when all they have to do is clear their local browser cache?

    Even more aggravating is when you have a customer who is signed on with a certain phone company's high speed ADSL service. They proxy the living hell out of it so you can't run a server off it and keep cache's for up to 6 hours. You can see the reflected change, and you know everyone else can see the reflected change, but the sap you have on the phone keeps saying "Huh? what money, what do you need cash for? I have a maintenance contract"

  • by WSSA ( 27914 ) on Tuesday October 17, 2000 @02:59AM (#700152)
    The article starts out talking of "bandwidth starved" processors. But surely it's latency that is of primary importance and bandwidth takes second place?
  • Ofcourse there havn't been many improvments beyond "keep it close". The definition of a cache (in the non-computer sense) is someplace you keep stuff that you might need later. There have been few ground-breaking improvments on such popular caches as "the hole in the ground" and "that-drawer-on-the-left-of-my-desk-near-the-botto m-that-I-put-"stuff"-in". Outside of giving explicit control of caching to programs (which creates all kinds of sticky issues involving the ISA boundary, which means you'd proabably need a new ISA) what else do you expect? Why is that really worrying??

    God does not play dice with the universe. Albert Einstein

  • by fatphil ( 181876 ) on Tuesday October 17, 2000 @04:37AM (#700154) Homepage
    Unhappy with the performance of someone elses memory hungry code (50-100MB working RAM footprimt) I wrote my own version of the utility. It was only marginally faster than the original. However, I increased the performance by a factor of 10 when I realied that I could cache up jobs to do, then in turn perform those jobs on each 4MB chunk of data (Dec Alpha with 4MB L3 cache). I managed to increase performance even further by aiming the code at the L2 cache instead! The total number of bytes read/written was identical, but simply changing the order in whihc they were done increated performance 12-15 times.

    However, as soon as you do take into account caching issues, you sometimes start making non-portable decisions. (not always though, as generally most architectures have the problem but the lines are simply drawn in different places).

    FatPhil
  • Just because there's no innovation doesn't mean it's a bad idea; large chunks of processor design and algorithms still come from the early 70's and are still valid. That said, there may be a revolution in cache design just around the corner...
    --
  • Tell me, does the concept "burnproof" mean anything to you? Buffer underruns are a thing of the past.
  • Some years ago before inteligent caches, pipelining, branch prediction it used to be possible to get a human to optimize compiler generated asm code.
    These days compilers have to have more and more knowledge about their target cpu to provide efficient code, and due to all the different permutations of optimization and processor features its getting harder and harder to hand-optimize code. Which makes me wonder more and more is how efficient are our compilers that attempt to provide support for AMD,Intel, et al...
  • Without, we'd still be workign on computers that couldn't do all that much. It helps access memory quicker, and as any computer geek can tell, the slowest processor function is moving stuff back and forth from memory. I definetly think that we need to somehow fix this problem, and Computer Engineering PHD out there wanna give this a shot?

"Everything should be made as simple as possible, but not simpler." -- Albert Einstein

Working...