Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AMD Upgrades Hardware

AMD Details Next-Gen Kaveri APU's Shared Memory Architecture 128

crookedvulture writes "AMD has revealed more details about the unified memory architecture of its next-generation Kaveri APU. The chip's CPU and GPU components will have a shared address space and will also share both physical and virtual memory. GPU compute applications should be able to share data between the processor's CPU cores and graphics ALUs, and the caches on those components will be fully coherent. This so-called heterogeneous uniform memory access, or hUMA, supports configurations with either DDR3 or GDDR5 memory. It's also based entirely in hardware and should work with any operating system. Kaveri is due later this year and will also have updated Steamroller CPU cores and a GPU based on the current Graphics Core Next architecture." bigwophh writes links to the Hot Hardware take on the story, and writes "AMD claims that programming for hUMA-enabled platforms should ease software development and potentially lower development costs as well. The technology is supported by mainstream programming languages like Python, C++, and Java, and should allow developers to more simply code for a particular compute resource with no need for special APIs."
This discussion has been archived. No new comments can be posted.

AMD Details Next-Gen Kaveri APU's Shared Memory Architecture

Comments Filter:
  • The PS4 (Score:4, Interesting)

    by MXPS ( 1091249 ) on Tuesday April 30, 2013 @12:38PM (#43592549)
    will feature this technology. It will be interesting to see how it stacks up.
  • I'm curious how long it will be before these optimizations are found in the compilers themselves.

  • As usual, AMD is leaving out some key information. What will be the TDP of such chips? I've always rooted for AMD and all my systems were built with them. You can't beat an Ivy Bridge chip for performance for watt though. With the i7-3770K, AMD doesn't offer anything compelling to compete. I like the idea that they're using the GCN architecture to assist with processing, but have they done anything to the lithography or power consumption? Intel's haswell chips come out soon and those are even better.
    • Power is key in the mobile space where a lot of chips are going. -Joe

      I hope that your i7-3770K is serving you well in your cell phone.

      • I guess I need to provide more information to help get my point across. Intel has 4th gen chips that run on a 7 watt TDP. The performance per watt is pretty remarkable. Intel's i7-3770K has a 77 watt TDP. AMD's FX-8350 has a 125 watt TDP, get's spanked by Intel in most benchmarks, and doesn't have any graphics chip on die to drive a monitor. Translating that down, Intel has an advantage. I would love to be proven wrong though.
        • Intel's i7-3770K has a 77 watt TDP. AMD's FX-8350 has a 125 watt TDP, get's spanked by Intel in most benchmarks, and doesn't have any graphics chip on die to drive a monitor.

          You know, that might be exactly the problem here. This is something completely different. If the GPU will be any decent, chances are that a combination of a high-end-GPU equipped APU with a lot of GDDR5 memory would make many HPC people much happier than Haswell ever could. In some application areas, it's all about bandwidth. Today, if you're trying to do HPC on, say, a 20GB dataset in memory, on a single machine, you're screwed.

        • Translating that down, Intel has an advantage.

          i7 3770k: £250
          FX 8350: £160

          Yes. Advantage Intel. Also take into account that quality motherboards are usually cheaper for AMD and that one can also upgrade more easily.

          The more apt comparison is to some i5. At that point, the 8350 beats it in a large number of benchmarks (and does actually beat the much more expensive i7). Basically in multi threaded code the FX8350 wins. In single threaded code the i5 wins.

          • I do agree with you. I'm simply referring to the simple tasks the general public does. Web surfing, iTunes, emails, etc. These are not heavily threaded tasks. Granted the difference is marginal because any modern processor can handle this with ease. Sure in highly threaded workloads the AMDs offer a better bang for your buck, but the general public does not do this on a day to day basis.
        • Re: (Score:2, Interesting)

          by skids ( 119237 )

          Speaking as someone currently considering buying slightly behind the curve, I was all set to jump on an Intel-based fanless system because of the TDP figures. However, with the PowerVR versions of the Intel GPU c**k-blocking linux graphics, and with AMD finally open-sourcing UVD, I'm now back to considering a Brazos. Less choices for fanless pre-built systems, though. May have to skip on the pay-a-younger-geek-because-I-dont-enjoy-playing-legos-anymore part.

          So no, for some markets, Intel has not yet real

    • by serviscope_minor ( 664417 ) on Tuesday April 30, 2013 @12:56PM (#43592737) Journal

      You can't beat an Ivy Bridge chip for performance for watt though.

      Ehugh. Yes no kind of.

      For "general" workloads IVB chips are the best in performance per Watt.

      In some specific workloads, the high core count piledrivers beat IVB, but that's rare. For almost all x86 work IVB wins.

      For highly parallel churny work that GPUs excel at, they beat all X86 processors by a very wide margin. This is not surprising. They replace all the expensive silicon that make general purpose processors go fast and put in MOAR ALUs. So much like the long line of accelerators, co processors, DSPs and so on, they make certain kinds of work go very fast and are useless at others.

      But for quite a few classes of work, GPUs trounce IVB at performance per Watt.

      The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.

      Only with the new architecture, which I (and presumably many others) hoped was AMDs long term goal a number of these problems have disappeared since the link is very low latency and the memory fully shared.

      This means the very superior performance per Watt (for some things) GPU can be used for a wider range of tasks.

      So yes, this should do a lot for power consumption for a number of tasks.

      • The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.

        The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.

        I don't know why everyone seems to forget that GPUs aren't just fast because they have a lot of ALUs (TFA included), they are fast because of the highly specialized GDDR memory they are attached t

        • by bored ( 40072 ) on Tuesday April 30, 2013 @01:52PM (#43593315)

          The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.

          Its both, for my application, the GPU is roughly 3x-5x as fast as a high end CPU. This is fairly common on a lot of GPGPU workloads. The GPU provides a decent but not huge performance advantage.

          But, we don't use the GPU! Why not? Because copying the data over the PCIe link, waiting for the GPU to complete the task, and then copying the data back over the PCI bus yields a net performance loss over just doing it on the CPU.

          In theory, a GPU sharing the memory subsystem with the CPU avoids this copy latency. Nor does it preclude still having a parallel memory subsystem dedicated for local accesses on the GPU. That is the "nice" thing about opencl/CUDA the programmer can control the memory subsystems at a very fine level.

          Whether or not AMD's solution helps our application remains to be seen. Even if it doesn't its possible it helps some portion of the GPGPU community.

          BTW:
          In our situation its a server system so it has more memory bandwidth than your average desktop. On the other hand, i've never seen a GPU pull more than small percentage of the memory bandwidth over the PCIe links doing copies. Nvidia ships a raw copy benchmark with the CUDA SDK, try it on your machines the results (theoretical vs reality) might surprise you.

        • The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s

          32GB/s doesnt sounds like a lot when you divide it amongst the 400 stream processors that an upper end AMD APU has, and thats as favorable a light as I can shine on your inane bullshit. There is a reason that discrete graphics cards have their own memory, and it isnt because they have more stream processors (these days they do, but they didnt always) .. its because PCI Express isnt anywhere near fast enough to feed any modern GPU.

          Llano APU's have been witnessed pulling 500 GFLOPS. Does 32GB/s still sound

          • Sigh. Here I go feeding the trolls.

            I'm not sure what point you're trying to make here, since MY main point in the rest of this topic was that modern GPUs are mostly limited by memory bandwidth, which makes the development in TFA pretty pointless. You're right! 32GB/s isn't enough to make the most of the computing resources available on a modern GPU! That was my point; How exactly would the GPU accessing main memory directly help? The fastest system RAM currently available in consumer markets in the fastest

            • by cynyr ( 703126 )

              because you wouldn't need to transfer it between the CPU and GPU? you could just pooint the GPU at main system ram and let it have at it.

      • by Kjella ( 173770 )

        Assuming you're willing to write special software that'll only see benefit on AMDs APUs, not on Intel nor anything with discrete GPUs. I suppose it's different for the PS4 or Xbox720 where you can assume that everyone that'll use the software will have it, but for most PC software the advantages would have to be very big indeed. If you need tons of shading power it's better to run on discrete GPUs, even with unified memory switching between shaders and cores isn't entirely free so it might not do that much

    • by Luckyo ( 1726890 )

      In terms of APUs, they have intel not just beat but utterly demolished. Intel has absolutely nothing on AMD when it comes to combination of slowish low TDP CPU and a built in GPU with performance of a low end discreet GPU.

      And while they lack CPU power for high end, wouldn't you want a discreet CPU with a discreet GPU in that segment in the first place?

  • by Anonymous Coward

    This should really help round trip times trough the GPU. With most existing setups, doing a render to texture, and getting the results back CPU side is quite expensive, but this should help a lot. It should also work great for procedural editing/generating/swapping geometry that you are rendering. Getting all those high poly LODs onto the GPU will not longer be an issue with systems like this.

    Interestingly enough, this is somewhat similar to what Intel has now for their integrated graphics, except it looks

  • One question they never seem to answer is why bother unifying the memory architecture at all? CPU and GPU memory architectures have always been different for the same reasons that CPUs and GPUs themselves are different; one is designed for fast execution of serial instructions with corresponding random smaller reads and writes to memory, and the other is designed for fast execution of parallel instructions with corresponding contiguous reads and writes that are much larger in size. It seems like you're just
    • Re:Why compromise? (Score:5, Informative)

      by SenatorPerry ( 46227 ) on Tuesday April 30, 2013 @01:13PM (#43592891)

      In OpenCL you need to copy items from the system memory to the GPU's memory and then load the kernel on the GPU to start execution. Then you must copy the data back from the GPU's memory at the end after execution. AMD is saying that you can instead pass a pointer to the data in the main memory instead of actually making copies of the data.

      This should reduce some of the memory shifting on the system and speed up OpenCL execution. It will also eliminate some of the memory constraints on OpenCL regarding what you can do on the GPU. On a larger scale it will open up some opportunities for optimizing work.

    • by dgatwood ( 11270 )

      I can see the benefit of being able to allocate a GPU/CPU-shared memory region in VRAM for fast passing of information to the GPU without a copy, but apart from making the above concept slightly cheaper to implement, the only benefit I could come up with for allowing the GPU access to main memory is making password theft easier. That and letting their driver developers write sloppier code that doesn't have to distinguish between two types of addresses....

      The most hilarious part of this is that while they'

    • Re:Why compromise? (Score:5, Insightful)

      by forkazoo ( 138186 ) <wrosecrans AT gmail DOT com> on Tuesday April 30, 2013 @01:13PM (#43592895) Homepage

      Because when you are doing stuff like OpenCL, dispatching from CPU space to GPU space has a huge overhead. The GPU may be 100x better at doing a problem than the CPU, but it takes so long to transfer data over to the GPU and set things up that it may still be faster to do it on the CPU. It's basically the same argument that led to the FPU being moved onto the same chip as the CPU a generation ago. There was a time when the FPU was a completely separate chip,a nd there were valid reasons why it ought to be. But, moving it on chip was ultimately a huge performance win. The idea behind AMD's strategy is basically to move the GPU so close to the CPU that you use it as freely as we currently use the FPU.

      • Wrong! The GPU is only 100x faster at doing certain problems because of the fast GDDR memory it is attached to which is optimized for very large sequential reads and writes. There are a tiny number of applications that require huge numbers of FLOPs on very small amounts of data (BitCoin mining and password hashing attacks come to mind, but that's about it.)
    • In low-cost systems the CPU and GPU are combined on a single chip with a single (slow) memory controller. Given that constraint, AMD is trying to at least wring as much efficiency as they can from that single cheap chip. I salute them for trying to give customers more for their money, but let's admit that this hUMA thing is not about breaking performance records.

    • nah. providing wider and faster memory will help even purely CPU codes, even those that are often quite cache-friendly. the main issue is that people want to do more GPUish stuff - it's not enough to serially recalculate your excel spreadsheet. you want to run 10k MC sims driven from that spreadsheet, and that's a GPU-like load.

      but really it's not up to anyone to choose. add-in GPU cards are dying fast, and CPUs almost all have GPUs. so this is really about treating APUs honestly, rather than trying to

  • They talk about passing pointers back and forth as though the GPU and CPU effectively share an MMU. The problem is, GPUs and CPUs don't work the same way. GPUs need to access shared resources that are per-system, whereas CPUs need to limit access to resources on a per-process basis. It would be devastating if a GPU could, for example, allow an arbitrary user-space process to overwrite parts of the kernel and inject virus code that runs with greater-than-root privilege. It would similarly be devastating

    • My understanding is that there will indeed be something like RWX control. Not just for security, but also for performance. If boths ides can freely write to a chunk of memory, you can get into difficulties accounting for caches in a fast way.

      That said, if the CPU and the GPU are basically sharing an MMU, then the GPU may be restricted from accessing pages that belong to process that aren't being rendered/computed. There's no reason why two different applications should be able to clobber each other's tex

    • GPUs need to access shared resources that are per-system, whereas CPUs need to limit access to resources on a per-process basis.

      If you plan to make the GPU easy to use as a general computing resource (which, according to the writeup, seems to be what they're aiming at) the GPU needs to also be working at a per-process basis and linked to the main system memory so that results are easily available to the main system for I/O, etc.

      Of course, even if this is their goal, one question still remains... Will thi

  • Name is a pun (Score:2, Informative)

    by Anonymous Coward

    Apparently not too many finnish speakers here yet. Kaveri => partner/pal/mate, APU => help.

    HTH,

    ac

    • by Anonymous Coward

      Apparently not too many finnish speakers here yet. Kaveri => partner/pal/mate, APU => help.

      HTH,

      ac

      "Kaveri" is actually the name of a major river in Karnataka, a state in India. AMD names its cores on major rivers all around the world.

      HTH.

    • by Radak ( 126696 )

      It's the night before Vappu. We're way too busy getting drunk in Finland.

    • It's not just Finnish. Hebrew chaver [wiktionary.org] may be the common etymology for both this and the Finnish word. It is also the origin of the Dutch word gabber [wikipedia.org].

      OTOH, the pun with APU is harder to explain without Finnish.

  • With a GPU next to the CPU the latency between them is reduced, this is awesome for OpenCL applications. Imagine you wanted to work a markov model into your AI and you needed to a large number of matrix calculations to get it to run properly and you want it in real time, I think this might solve that problem. I'm imagining game AI improving with adoption of this style of processor. Anyone see this differently?

    • by godrik ( 1287354 )

      I don't know... This heterogeneous computing with low latency seems interesting if it does not harm raw performance. The main advantage would be to transport data back and forth between the two. If the computation on one side is long, then the decrease in latency is not very useful. If both of them are really fast, then there is not too much to gain to begin with.

      It really helps when you need fast turn around so for small and very synchronous computation. I am waiting to see one good usecase.

    • by gmueckl ( 950314 )

      Learning AIs in games have been problematic in the past. Mostly it is about control over the experience that gets delivered to the customer: as a designer your job is to get it just right. You can do this easily with current more or less heuristic AI algorithms. The ability to learn opens the scope of possible behaviours so much that it's not possible anymore to deliver a guaranteed experience.

      Short version: the designer can't stop the game from going nuts in unpredictable ways because of stupid player inpu

  • I think AMD overrate heterogenous computing. The assumption is that all applications can take advantage of GPGPU. This is simply not true. Only certain types of application are suitable, such as multimedia and simulation - where it's very obvious what part of the code can be parallelised.

  • by juancn ( 596002 ) on Tuesday April 30, 2013 @02:11PM (#43593527) Homepage
    Today I read an an article in Gamasutra [gamasutra.com] that details some of the internals of the PlayStation 4 and the architecture looks a lot like what's described here.

    With GDDR5 memory this could be very interesting.

    • Holy crap -- has hell frozen over? Sony is actually thinking about developers for once!? Using (mostly) off the shelf commodity parts is definitely going to help win back some developers. Time will tell if "they are less evil then Microsoft"

      Thanks for the great read.

  • by Shinobi ( 19308 ) on Tuesday April 30, 2013 @03:11PM (#43594197)

    OK, so the SGI O2's UMA has now been reinvented for a new generation, just with more words tacked on....

  • I'm interested to see what the software model for this will be. Sure they could use OpenCL, but it seems like a lot of the pain in using OpenCL derives from the underlying memory architecture. With a shared virtual address space and fully coherent caches all in hardware, it should be possible to have a much simpler software model than OpenCL. I guess it doesn't really matter what the software model is though since now that everything is in main memory, GPU functions can be called just like regular functi
    • Indeed it should be easier. There will still be some cost, since the processors are still in thread bundles and still trade speed for throughput, but the cost will be much lower. I expect the break even point will be pretty small though and won't have the huge disadvantage of limited memory for very large things.

      I wonder what the low level locking primitives between the GPU and CPU will be. Those will have some effect on the speed.

      I also wonder what/how the stream processors will be dealt with by the OS an

  • Why would a graphics card want to use virtual memory? Also, what motherboard takes GDDR5? Who the heck wrote this nonsense?
    • Why would a graphics card want to use virtual memory?

      Shared physical memory avoids the cost of copying data to and from the GPU but without shared virtual memory the data will end up at different addresses on the CPU and GPU. This means that you cannot use pointers to link parts of the data together and must rely on indexes of some sort. This makes it harder to port existing code and data structures to use GPU computation.

      Also, with shared physical memory you have to tell the device which memory you want to use (so that it can tell you which address to use).

  • In my experience GPU and especially GPGPU bottleneck is not amount of memory but memory access bandwidth. 256-512 bit is not adequate for existing apps. Before amount of memory will become important manufacturers should move to at least 2048 bit mem bus and also increase amounts of registers per core several times.
  • I haven't seen this magical word in the presentation. Moreover I do not see the CPU/GPU convergence often talked about. It sounds more like a marketing hype. Moreover the ecosystem could be enriched with DSP or Network processor cores all uniformly offering their resources to software, I did not see it.
  • The technology is supported by mainstream programming languages like Python, C++, and Java, and should allow developers to more simply code for a particular compute resource with no need for special APIs.

    So how do you do this in Java, Python? Did nobody ask? I did a search for "java huma uniform memory access" and this page came up first with nothing from java.com or oracle in sight.

    Ok more searching says to use OpenCL and lots of stackoverflow questions... but they're not new... and OpenCL is not Java. W

"If value corrupts then absolute value corrupts absolutely."

Working...