Forgot your password?
typodupeerror
Intel Upgrades Hardware

Intel's 128MB L4 Cache May Be Coming To Broadwell and Other Future CPUs 110

Posted by timothy
from the now-read-some-old-prices-and-get-offa-my-lawn dept.
MojoKid writes "When Intel debuted Haswell this year, it launched its first mobile processor with a massive 128MB L4 cache. Dubbed "Crystal Well," this on-package (not on-die) pool of memory wasn't just a graphics frame buffer, but a giant pool of RAM for the entire core to utilize. The performance impact from doing so is significant, though the Haswell processors that utilize the L4 cache don't appear to account for very much of Intel's total CPU volume. Right now, the L4 cache pool is only available on mobile parts, but that could change next year. Apparently Broadwell-K will change that. The 14nm desktop chips aren't due until the tail end of next year but we should see a desktop refresh in the spring with a second-generation Haswell part. Still, it's a sign that Intel intends to integrate the large L4 as standard on a wider range of parts. Using EDRAM instead of SRAM allows Intel's architecture to dedicate just one transistor per cell instead of the 6T configurations commonly used for L1 or L2 cache. That means the memory isn't quite as fast but it saves an enormous amount of die space. At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory."
This discussion has been archived. No new comments can be posted.

Intel's 128MB L4 Cache May Be Coming To Broadwell and Other Future CPUs

Comments Filter:
  • I have a Retina MacBook Pro with this Crystal Well processor. What advantages does it really bring?

    Unsure of any real world benchmarks compared to standard Haswell processors.

    • by SimonTheSoundMan (1012395) on Saturday November 23, 2013 @08:19AM (#45500111) Homepage

      The only benchmarks I have found is from SiSoftware. http://www.sisoftware.co.uk/?d=qa&f=mem_hsw [sisoftware.co.uk]

      But how is this going to effect Firefox, Photoshop, or video conversion?

      Does it have an effect on battery life?

      • by K. S. Kyosuke (729550) on Saturday November 23, 2013 @08:28AM (#45500145)
        On laptops? Perhaps it could, I suspect that an eDRAM cache+slower main memory could have lower total power consumption at the same performance level than a faster main memory, especially if you have more of it. I believe that the major power usage component for main memory DRAMs is actually using the memory (as in, transferring the data).
      • by muridae (966931) on Saturday November 23, 2013 @09:23AM (#45500301)

        Photoshop? Considering that the adobe rgb or other color spaces combined with the file sizes of some of the larger images coming out of cameras, your gains in latency would really depend on Photoshop and the OS being able to handle the L4 cache and keep the right part of the image in the cache. Video editing, with file sizes into the gigabyte range would probably see no gains at all. Video conversion, with a program that keeps a reasonably sized buffer, should see a good performance gain; but it would require code that knows the L4 is available or the OS to predict that L4 is a good place to put a 10-50-100MB buffer. The real gain will be in common things: playing a video, browsing the web (seen how much memory a bit of javascript or the JRE can eat up lately? Or Silverlight/Flash?) and email clients (cache all your email in L4 for faster searching).

        As for battery life, I have no idea. It might use more power, since DRAM requires constant power to refresh data where SRAM is pretty stable; but the lower leakage of using a single transistor instead of 6 might prove to be a benefit. It would take a good bit of time and some pretty good test code to figure the difference, I suspect.

        • by neokushan (932374)

          I'm not an expert by a long shot, but I'm pretty sure that modern day applications don't go anywhere near that low a level and instead leave memory management up to the system.

          • by windwalkr (883202) on Saturday November 23, 2013 @07:48PM (#45504109)

            Yes and no. Applications can't typically "put things into the cache", but algorithms can (and often are, when it comes to image processing) tuned to suit a particular cache size. Processing the image in an appropriate order, breaking the image into cache-sized chunks, and so on can all be effective strategies which pay off big-time in terms of performance.

        • by fa2k (881632)

          Even if the whole files take up more than the cache, the filters and algorithms running on them may need to access only a part of the image/video (e.g. a access a frame of the video multiple times). The benefit of caching is highly dependent on the algorithms used

        • by Bengie (1121981)
          CPUs have had streaming instructions for a long time that can tell the CPU to load data directly from main memory to L1 cache and not use L2/L3/L# cache at all. This reduce cache eviction for data that is transient.
    • by fuzzyfuzzyfungus (1223518) on Saturday November 23, 2013 @08:38AM (#45500187) Journal
      At least as marketed, the main advantage is allowing the GPU some RAM that isn't DDR3 stolen from the main system a couple of hops away (which has traditionally been one of the things that make integrated graphics really suck, and cheap discrete parts that use DDR instead of GDDR, and/or an excessively narrow or slow memory bus kind of suck).

      Given that even intel's marketing optimists don't say much about CPU performance (and also given that it's a mobile-only feature, you can't even buy an non-BGA part expensive enough to have it, which would be unusual if it actually improved CPU performance enough to get enthusiasts worked up; but is downright sensible if the target market is laptops sufficiently size/power constrained not to have discrete GPUs; but where pure shared memory was dragging GPU performance down.)
      • by Bengie (1121981)
        128MB of L4 cache and Transactional Memory instructions will make it great for routers.
        • Interesting that the gigabit Ethernet controller on the latest Apple Mac's have 512MiB of DDR3. Any idea what this is for?

          • by Bengie (1121981)
            No idea. I was under the impression that most NICs have just enough onboard memory to buffer potential bursts of data, but otherwise write to system memory via DMA and interrupt the CPU to notify it when the data is ready. 512MB sounds like a lot of buffer for just a 1gb NIC.
            • by thejynxed (831517)

              Actually, that amount on a NIC would be a great boon in keeping all network processing on the NIC instead of having to CPU/system memory-offload, especially when you turn on the bells and whistles like jumbo frames, etc. I can also see it helping out quite a bit when processing HD video packets when streaming video where it's pretty important to get them processed as quickly and efficiently as possible before passing them off to the main system. These packets tend to have a decent amount of overhead, etc an

        • I don't doubt that it either does or will have uses beyond graphics, I just find Intel's marketing, labelling, and packaging choices utterly inscrutable if non-graphics uses are actually ready for prime time.

          The only sign, unless you delve into the part numbering alphabet soup, that you even have it, is a change in the designation of the graphics "Iris Pro 5200" vs "Iris Pro 5100", and it's only available in the highest-price laptop parts. I have no reason to suspect that it'll hurt performance on the CP
    • by SuricouRaven (1897204) on Saturday November 23, 2013 @09:51AM (#45500429)

      Cache performance impact is very heavily dependant upon application characteristics. Specifically, active memory.

      Best case, when you're working with an active set that's larger than L3 but under L4 - around 100MB or so - and you're accessing it on a repeating pattern, and the compiler hasn't found any tweaks to help, and you're not multitasking, and the OS isn't swapping you out every slice, and the stars are aligned in your favor... the theoretical maximum performance gain can be up to 2x. It's very rare you'll find a program that benefits that much, though. Closest I can think of is image processing.

      So in the real world, anywhere from 'no benefit' to 'double the speed' depending on application.

      • Closest I can think of is image processing.

        What you've quoted sounds more like a case for random accesses. Trees, graphs (!), and other complicated data structures, I'd guess. I believe that image processing can take care of itself most of the time by simple prefetching.

    • by MrKaos (858439) on Saturday November 23, 2013 @09:22PM (#45504507) Journal

      I have a Retina MacBook Pro with this Crystal Well processor. What advantages does it really bring?

      Unsure of any real world benchmarks compared to standard Haswell processors.

      I've written papers on the effect however I am unable to share them here. The bottom line is the application should be exposed to reduced minor page faulting and, if all goes well, improved context switching, all dependent on the way the CPU scheduler is configured - of course.

      IMHO an L4 cache will alleviate the cache miss penalty when the CPU Scheduler looks for data in L1-3 however any increase in the penalty due to a cache miss will be highly dependent on the application and the way the CPU scheduler is configured.

      The idea is to try and keep the L1-3 Cache as hot as possible, really it's because as programmers, many of us still have a long way to go to writing code that scales to parallel processing well (in the 21st century!!!) plus there is a lot of code out there already.

      For Linux and Apple based systems (I can examine the code of these CPU Schedulers - just not the Microsoft as it is proprietary) this should mean that the amount of time the CPU spends on application tasks, as opposed to O.S tasks is increased, essentially boiling down to reduced application latency and improved "responsiveness". I don't mean to use such wishy-washy terms however at this level cpu instructions are carried out in the nano-femto second range and the duration imposed by a cache miss penalty and a context switch will also be dependent on the ram installed - which is another factor in the duration of a minor page fault.

      Assuming that the schedulers, in a "fair and balanced" configuration I expect the following. For code that scales to parallelism you should see improvements because a task will exist on multiple cores well and not incur penalties for hogging CPU resulting in the L1-3 caches staying hot with application data longer (ideally, with threads running on multiple cores). For code that doesn't I expect it to hog a core, get pushed back to ram by the scheduler and be exposed to all of the performance penalties that come as a result.

      Personally I have always thought it's a contest between Cycles and Cache - not a direct effect on battery life or power consumption however if the CPU is spending more time on application than OS then you are closer what the original Amdahl's law [wikipedia.org] sought to show - if your application allows it.

  • . . .that Broadwell broad, well, is a broad well into which you could throw your entire career.
    Just say no, David.
  • by Anonymous Coward on Saturday November 23, 2013 @08:14AM (#45500093)

    "At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory."

    WTF? The correct would be, I think, half the latency of main memory...

    • by Anonymous Coward
      Or double the speed.
    • It's also pretty poor, for a cache. That's why it's an L4 cache, rather than replacing the L3 or L2.

  • by GiantRobotMonster (1159813) on Saturday November 23, 2013 @08:18AM (#45500105)

    At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory.

    Hmmm. L4 cache runs at half the speed of main memory? That doesn't seem right Why bother reading these summaries? The people posting them certainly don't

    • by Anonymous Coward

      Umm, they are clearly using latency as their measure of speed, so yes, 'half the speed of main memory" does seem right. Sure it's not worded as well as it could be, but you should be able to understand and not bitch about it.

      • by Anonymous Coward
        Technical articles should be written carefully.
      • Oh please! It should be twice the speed if it has half the latency, not half the speed. Speed and latency are related, but not interchangeable synonyms!
        If a cache has "half the speed" of your uncached memory, you need to disable that cache ASAP!

      • by Bengie (1121981)
        It is actually 1/2 the latency and 2x the bandwidth. Some benchmarks have shown the L4 getting 42GB/s while current and prior gen CPUs were getting about 17GB/s out to main memory.
    • Half the latency is meant, but yes, very confusing wording.
    • by Nite_Hawk (1304)

      I suspect they meant half the latency

    • Why bother reading these summaries?

      It's a puzzle. Is the summary wrong because of stupidity, or is it crafted that way for click bait?

  • by Anonymous Coward

    Broadwell represents a miniaturization step from 22 to 14 nm structures. Why do they keep the capacity of the Crystalwell L4 cache at 128 MB? They could put twice that memory onto a die with the same area as the 22 nm Crystalwell version. Is the Crystalwell die for the Haswell CPUs so large and expensive that they have to reduce its size?

    • It's in the same package, but not made in the same silicon or process. The package contains several pieces of silicon. Look at it as a miniature circuit board with several individual chips on it.
    • Re:Why only 128 MB? (Score:5, Informative)

      by Kjella (173770) on Saturday November 23, 2013 @10:32AM (#45500591) Homepage

      Broadwell represents a miniaturization step from 22 to 14 nm structures. Why do they keep the capacity of the Crystalwell L4 cache at 128 MB? They could put twice that memory onto a die with the same area as the 22 nm Crystalwell version. Is the Crystalwell die for the Haswell CPUs so large and expensive that they have to reduce its size?

      From Anandtech's article on Crystalwell [anandtech.com]:

      There's only a single size of eDRAM offered this generation: 128MB. Since it's a cache and not a buffer (and a giant one at that), Intel found that hit rate rarely dropped below 95%. It turns out that for current workloads, Intel didn't see much benefit beyond a 32MB eDRAM however it wanted the design to be future proof. Intel doubled the size to deal with any increases in game complexity, and doubled it again just to be sure. I believe the exact wording Intel's Tom Piazza used during his explanation of why 128MB was "go big or go home". It's very rare that we see Intel be so liberal with die area, which makes me think this 128MB design is going to stick around for a while.

      I get the impression that the plan might be to keep the eDRAM on a n-1 process going forward. When Intel moves to 14nm with Broadwell, it's entirely possible that Crystalwell will remain at 22nm. Doing so would help Intel put older fabs to use, especially if there's no need for a near term increase in eDRAM size. I asked about the potential to integrate eDRAM on-die, but was told that it's far too early for that discussion. Given the size of the 128MB eDRAM on 22nm (~84mm^2), I can understand why. Intel did float an interesting idea by me though. In the future it could integrate 16 - 32MB of eDRAM on-die for specific use cases (e.g. storing the frame buffer).

  • Win95? (Score:4, Interesting)

    by mwvdlee (775178) on Saturday November 23, 2013 @08:46AM (#45500205) Homepage

    With this 128MB cache, shouldn't this CPU be able to run an OS like Win95 of an older Linux without additional memory?

    • Not necessarily. It's not just a shadow copy of RAM, but some kind of multipurpose pool. We don't exactly know what the CPU does with it.
      • by muridae (966931)
        If you are writing the OS and your code is down at the machine level, you do know what's going on in the different cache pools. You can abstract it away and trust your compiler to get it right, or you can fiddle the bits yourself; it isn't magic contained in the blue smoke of ICs.
      • by Anonymous Coward

        Maybe they will get it right next time with the L5 cache.

        • Formerly, L4 cache was main memory, a cache for the L5 (disk) and L6 (network). This new L4 cache pushes main memory, disk, and network out to L5, L6, and L7 respectively.
    • by aiadot (3055455)
      Win95? My first laptop had 128MB of RAM and was capable of running XP.

      But the answer is no. The OS just wasn't designed to use that function of the processor in such way. Maybe if you wrote a VM that emulated RAM on the L4 cache, but the only purpose of this approach would be satisfy the curiosity.
      • Did your 128 MB laptop continue to run Windows XP well even after having installed the service packs that increased how much RAM it uses? Even under Windows 2000, printing certain documents filled RAM on my old 128 MB desktop PC.
        • by aiadot (3055455)
          No it didn't. If my memory serves me right it took almost 5 minutes from boot to usable "smooth" state. Can't say anything about printing large professional documents because I was just a high schooler that managed to by a crappy computer for traveling after saving lots of money. However, thanks to that, I learnt how to linux.
    • Yeah*, but what's the point?

      *Assuming the OS doesn't freak out - which will definitely happen. Let's just say there's no technical barrier to overcome.

    • by archen (447353)

      Depends on what you're doing with it. I have a laptop (Pentium 3 / 128Mb RAM) with FreeBSD 10 on it. It works well but the application options running in X are limited unless you want to go into swap. A huge portion of what people consider regular computer usuage is "browse the internet". Good luck doing that these days with 128Mb RAM.

  • This is making me feel old as I recall how happy I was to have once maxed a board with 32 MB of RAM, a previous one with 8 MB, another with 4 MB and so on. I love that about technology, it pretty much always gets better until DRM and politics get into the mix...

    /get off my lawn

    //not really

    • Re:128 MB L4 cache (Score:4, Interesting)

      by Gravis Zero (934156) on Saturday November 23, 2013 @10:13AM (#45500523)

      you can revisit those those nostalgic 8MB and 4MB days again with the latest AMD chips [wikipedia.org] as L2 cache. :)

      just use a modifide version coreboot [coreboot.org] to bypass those silly POST tests and load to the CPU cache directly with Windows 3.11 :)

    • by neokushan (932374)

      I got the same feeling when I got my first Android phone. 576MB of RAM...in a phone. I've recently upgraded and my new device has 3GB of RAM. It feels like only recently that I hit that amount in a desktop computer, now I have it in a device that fits in my pocket - never mind the quad-core CPU or 64GB of internal storage.

      10 years ago, that would have been a reasonably powerful desktop machine.

      • by fisted (2295862)
        It's still [like] a reasonably powerful desktop machine, if you avoid running a bloated OS on it.
        • by Shinobi (19308)

          It would still be a "reasonably powerful desktop machine" if your use case is still the same as 10 years ago.

          However, contrary to what many geeks think, people don't just browse, do email, watch youtube etc.. A fair amount of non-geeks do CAD, image/video editing, 3D graphics, create music etc with their desktop machines, and routinely have workloads that would bring that 10 year old computer into thrashing hell...

          In fact, I think the whole "oh, ordinary people just need enough power to browse, email etc" i

          • by fisted (2295862)

            It would still be a "reasonably powerful desktop machine" if your use case is still the same as 10 years ago.

            Yeah, I was a hard- and software developer 10 years ago, my use case didn't shift too much. OTOH, I happen to /do/ CAD on this nearly 10 year old computer (single-core and all that, can you believe it?), so my information is first-hand.

            [meaningless windows-centric gibberish excusing bloatware]

            Whatever.

          • by surd1618 (1878068)
            So true. Nerds who are not computer nerds often have the highest computing needs. I can do everything I usually want to do with an ancient laptop because I don't do graphic design or record music or make 3D models. Shoot, if I am going to play a video game it's probably Doom or Starcraft. A computer that plays youtube videos reliably will do anything I want in IDLE or Emacs and even runs small VMs okay. The only thing I'd want a modern desktop for would be video format conversion or bloated Processing code.
    • This is making me feel old

      If your youthful recollections are about memory measured in MEGAbytes, then you are not old. Back in the 1970s, I worked on a controller board with a Z80, 256 bytes of ROM and ZERO RAM. All the state information had to be kept in registers (but, fortunately, a Z80 has two register banks). No RAM means no stack, so to call a subroutine, you had to save the return address in a register, so subroutines couldn't nest. As I recall, it just had to monitor a voltage and dial a phone number if it dropped too lo

  • not on die (Score:5, Informative)

    by Gravis Zero (934156) on Saturday November 23, 2013 @10:00AM (#45500467)

    128MB L4 cache. [...] on-package (not on-die) pool of memory

    what this means is the memory is not on the same piece of silicon as the CPU, just stuffed in the same chip package. this means they have to be connected by a lot of tiny wires instead of being integrated directly. the downside to this is that there is bandwidth between the L4 memory and the CPU is very limited and it uses more power. like AMD's first APUs where just two ICs on the same chip, i dont not think this will result in a drastic performance improvement but i'm unsure of the power savings. If AMD gets wise, they will beat Intel to the punch but then again. though if AMD is really smart, they would put out ARMv8 chips not just for servers(/desktops?) but for smartphones/tablets and laptops.

    • Re:not on die (Score:5, Informative)

      by lenski (96498) on Saturday November 23, 2013 @11:20AM (#45500783)

      what this means is the memory is not on the same piece of silicon as the CPU, just stuffed in the same chip package.

      Which allows the designers to count on carefully controlled impedances, timings, seriously optimized bus widths and state machines, and all the other goodies that come with access to internal structures not otherwise available.

      Such a resource could, if used properly, be a significant contributor to performance competitiveness.

    • by Salgat (1098063)
      Not on die means they have more control over quality and costs, as you don't need to scrap both the L4 cache and CPU if either die is bad. I personally love SoC and want to see more of it. One day we may see much of the motherboard all internalized on the same package as the CPU; this L4 cache could be just the first step to eventually internalizing RAM.
  • by Anonymous Coward

    All my algorithm development so far assumes small local caches.
    Now I can start over again.
    Aaaahhh!!!

  • I may not not get the speed out the caches but when you consider how much RAM is utilized in your laptop, smartphone, etc., this is actually a smart move. More room means means a better way to utilize the RAM allowing other opportunities to exist..
  • At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory.

    Don't you mean "but less than half the latency of main memory?"

  • by fisted (2295862)
    So as soon as i get one of these, i won't need any DRAM anymore, since 128MB is way more than my typical memory footprint (including kernel and X11)

    I do look forward to this.
  • Eh, it's been done. (Score:2, Informative)

    by Anonymous Coward

    POWER8, anyone? With actual SMT instead of flakey HT, and lots more threads, and so on, and so forth.

    Too bad they're unobtanium and if not cost too much. But otherwise... anything intel does has basically been done better before. Except process. That is the only thing they really lead with. The rest isn't half as interesting as most of the world makes it out to be.

  • With Intel's 14nm so close, and 10nm production in another year or so, they need to use all that chip area for something that doesn't necessarily generate a ton of heat. RAM is the perfect thing. Not only is the power consumption relatively disconnected from the size and density of the cache, but not having to go off-chip for a majority of memory operations means that the external dynamic ram can probably go into power savings mode for most of its life, reducing the overall power consumption of the device

  • No one ever needs more than 640KB. :P

If a listener nods his head when you're explaining your program, wake him up.

Working...