AMD Details Next-Gen Kaveri APU's Shared Memory Architecture 128
crookedvulture writes "AMD has revealed more details about the unified memory architecture of its next-generation Kaveri APU. The chip's CPU and GPU components will have a shared address space and will also share both physical and virtual memory. GPU compute applications should be able to share data between the processor's CPU cores and graphics ALUs, and the caches on those components will be fully coherent. This so-called heterogeneous uniform memory access, or hUMA, supports configurations with either DDR3 or GDDR5 memory. It's also based entirely in hardware and should work with any operating system. Kaveri is due later this year and will also have updated Steamroller CPU cores and a GPU based on the current Graphics Core Next architecture."
bigwophh writes links to the Hot Hardware take on the story, and writes "AMD claims that programming for hUMA-enabled platforms should ease software development and potentially lower development costs as well. The technology is supported by mainstream programming languages like Python, C++, and Java, and should allow developers to more simply code for a particular compute resource with no need for special APIs."
The PS4 (Score:4, Interesting)
Re: (Score:3)
One of the problems with the PS3 is that it didn't have shared memory. Maybe you're thinking of the 360.
Re: (Score:2)
Re: (Score:2)
Interesting (Score:2)
I'm curious how long it will be before these optimizations are found in the compilers themselves.
Where's the fine print? (Score:1)
Re: (Score:2)
Power is key in the mobile space where a lot of chips are going. -Joe
I hope that your i7-3770K is serving you well in your cell phone.
Re: (Score:1)
Re: (Score:2)
Intel's i7-3770K has a 77 watt TDP. AMD's FX-8350 has a 125 watt TDP, get's spanked by Intel in most benchmarks, and doesn't have any graphics chip on die to drive a monitor.
You know, that might be exactly the problem here. This is something completely different. If the GPU will be any decent, chances are that a combination of a high-end-GPU equipped APU with a lot of GDDR5 memory would make many HPC people much happier than Haswell ever could. In some application areas, it's all about bandwidth. Today, if you're trying to do HPC on, say, a 20GB dataset in memory, on a single machine, you're screwed.
Re: (Score:2)
Translating that down, Intel has an advantage.
i7 3770k: £250
FX 8350: £160
Yes. Advantage Intel. Also take into account that quality motherboards are usually cheaper for AMD and that one can also upgrade more easily.
The more apt comparison is to some i5. At that point, the 8350 beats it in a large number of benchmarks (and does actually beat the much more expensive i7). Basically in multi threaded code the FX8350 wins. In single threaded code the i5 wins.
Re: (Score:1)
Re: (Score:2)
The only way you can state that "each tab in the browser is another thread" is if you're not using firefox. Even on 20.1, they still don't have that and a single bad tab can and does take the entire browser down. Hell even IE 9 finally got it straight with tabs. Doesn't handle too many tabs at once but it runs each one i a seperate thread. Just like Chrome does. Opera does the same AFAIK (don't use it). You're right about background tasks and such though as the multithread performance of an AMD CPU is far b
Re: (Score:2, Interesting)
Speaking as someone currently considering buying slightly behind the curve, I was all set to jump on an Intel-based fanless system because of the TDP figures. However, with the PowerVR versions of the Intel GPU c**k-blocking linux graphics, and with AMD finally open-sourcing UVD, I'm now back to considering a Brazos. Less choices for fanless pre-built systems, though. May have to skip on the pay-a-younger-geek-because-I-dont-enjoy-playing-legos-anymore part.
So no, for some markets, Intel has not yet real
Re: (Score:2)
You're welcome : ) http://4changboard.wikia.com/wiki/Falcon_Guide [wikia.com]
Re: (Score:2)
The sad thing is that the PowerVR's are actually pretty decent. The drivers (made by Intel) are to blame.
Re:Where's the fine print? (Score:5, Insightful)
You can't beat an Ivy Bridge chip for performance for watt though.
Ehugh. Yes no kind of.
For "general" workloads IVB chips are the best in performance per Watt.
In some specific workloads, the high core count piledrivers beat IVB, but that's rare. For almost all x86 work IVB wins.
For highly parallel churny work that GPUs excel at, they beat all X86 processors by a very wide margin. This is not surprising. They replace all the expensive silicon that make general purpose processors go fast and put in MOAR ALUs. So much like the long line of accelerators, co processors, DSPs and so on, they make certain kinds of work go very fast and are useless at others.
But for quite a few classes of work, GPUs trounce IVB at performance per Watt.
The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.
Only with the new architecture, which I (and presumably many others) hoped was AMDs long term goal a number of these problems have disappeared since the link is very low latency and the memory fully shared.
This means the very superior performance per Watt (for some things) GPU can be used for a wider range of tasks.
So yes, this should do a lot for power consumption for a number of tasks.
Re: (Score:2)
The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.
I don't know why everyone seems to forget that GPUs aren't just fast because they have a lot of ALUs (TFA included), they are fast because of the highly specialized GDDR memory they are attached t
Re:Where's the fine print? (Score:4, Informative)
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.
Its both, for my application, the GPU is roughly 3x-5x as fast as a high end CPU. This is fairly common on a lot of GPGPU workloads. The GPU provides a decent but not huge performance advantage.
But, we don't use the GPU! Why not? Because copying the data over the PCIe link, waiting for the GPU to complete the task, and then copying the data back over the PCI bus yields a net performance loss over just doing it on the CPU.
In theory, a GPU sharing the memory subsystem with the CPU avoids this copy latency. Nor does it preclude still having a parallel memory subsystem dedicated for local accesses on the GPU. That is the "nice" thing about opencl/CUDA the programmer can control the memory subsystems at a very fine level.
Whether or not AMD's solution helps our application remains to be seen. Even if it doesn't its possible it helps some portion of the GPGPU community.
BTW:
In our situation its a server system so it has more memory bandwidth than your average desktop. On the other hand, i've never seen a GPU pull more than small percentage of the memory bandwidth over the PCIe links doing copies. Nvidia ships a raw copy benchmark with the CUDA SDK, try it on your machines the results (theoretical vs reality) might surprise you.
Re: (Score:3)
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s
32GB/s doesnt sounds like a lot when you divide it amongst the 400 stream processors that an upper end AMD APU has, and thats as favorable a light as I can shine on your inane bullshit. There is a reason that discrete graphics cards have their own memory, and it isnt because they have more stream processors (these days they do, but they didnt always) .. its because PCI Express isnt anywhere near fast enough to feed any modern GPU.
Llano APU's have been witnessed pulling 500 GFLOPS. Does 32GB/s still sound
Re: (Score:2)
Sigh. Here I go feeding the trolls.
I'm not sure what point you're trying to make here, since MY main point in the rest of this topic was that modern GPUs are mostly limited by memory bandwidth, which makes the development in TFA pretty pointless. You're right! 32GB/s isn't enough to make the most of the computing resources available on a modern GPU! That was my point; How exactly would the GPU accessing main memory directly help? The fastest system RAM currently available in consumer markets in the fastest
Re: (Score:2)
because you wouldn't need to transfer it between the CPU and GPU? you could just pooint the GPU at main system ram and let it have at it.
Re: (Score:2)
Assuming you're willing to write special software that'll only see benefit on AMDs APUs, not on Intel nor anything with discrete GPUs. I suppose it's different for the PS4 or Xbox720 where you can assume that everyone that'll use the software will have it, but for most PC software the advantages would have to be very big indeed. If you need tons of shading power it's better to run on discrete GPUs, even with unified memory switching between shaders and cores isn't entirely free so it might not do that much
Re: (Score:2)
Re: (Score:2)
In terms of APUs, they have intel not just beat but utterly demolished. Intel has absolutely nothing on AMD when it comes to combination of slowish low TDP CPU and a built in GPU with performance of a low end discreet GPU.
And while they lack CPU power for high end, wouldn't you want a discreet CPU with a discreet GPU in that segment in the first place?
CPU - GPU - CPU latency (Score:2, Interesting)
This should really help round trip times trough the GPU. With most existing setups, doing a render to texture, and getting the results back CPU side is quite expensive, but this should help a lot. It should also work great for procedural editing/generating/swapping geometry that you are rendering. Getting all those high poly LODs onto the GPU will not longer be an issue with systems like this.
Interestingly enough, this is somewhat similar to what Intel has now for their integrated graphics, except it looks
Re: (Score:3)
/Oblg.
http://www.dvhardware.net/news/nvidia_intel_insides_gpu_santa.jpg [dvhardware.net]
or
http://media.bestofmicro.com/V/6/233106/original/feature_image09.jpg [bestofmicro.com]
Why compromise? (Score:2)
Re: (Score:1)
Having a unified memory is a nice thing, but i expect it will only make a difference in something like the PS4, where you can target a specific architecture, which has GDDR5 as main memory, and doesn't have a discrete GPU. These two points are relevant: if you have "normal" DDR3 you loose a lot more than you gain by having UMA, and this will not change a thing in discrete GPUs because the PCIe bus is going to always be in the way of the GPU accessing main memory.
I think it is more a "ni
Re:Why compromise? (Score:5, Informative)
In OpenCL you need to copy items from the system memory to the GPU's memory and then load the kernel on the GPU to start execution. Then you must copy the data back from the GPU's memory at the end after execution. AMD is saying that you can instead pass a pointer to the data in the main memory instead of actually making copies of the data.
This should reduce some of the memory shifting on the system and speed up OpenCL execution. It will also eliminate some of the memory constraints on OpenCL regarding what you can do on the GPU. On a larger scale it will open up some opportunities for optimizing work.
Re: (Score:2)
I'm sure that they've thought about that already. The question is whether they've done the work necessary to deal with the problem.
Re: (Score:1)
Re: (Score:2)
I can see the benefit of being able to allocate a GPU/CPU-shared memory region in VRAM for fast passing of information to the GPU without a copy, but apart from making the above concept slightly cheaper to implement, the only benefit I could come up with for allowing the GPU access to main memory is making password theft easier. That and letting their driver developers write sloppier code that doesn't have to distinguish between two types of addresses....
The most hilarious part of this is that while they'
Re:Why compromise? (Score:5, Insightful)
Because when you are doing stuff like OpenCL, dispatching from CPU space to GPU space has a huge overhead. The GPU may be 100x better at doing a problem than the CPU, but it takes so long to transfer data over to the GPU and set things up that it may still be faster to do it on the CPU. It's basically the same argument that led to the FPU being moved onto the same chip as the CPU a generation ago. There was a time when the FPU was a completely separate chip,a nd there were valid reasons why it ought to be. But, moving it on chip was ultimately a huge performance win. The idea behind AMD's strategy is basically to move the GPU so close to the CPU that you use it as freely as we currently use the FPU.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
They sell this stuff under the brand name Xeon Phi now. It's something like 60 simplified x86-like units on a die. Looks like they only cater to big orders from supercomputer builders right now.
The real reason is cost (Score:2)
In low-cost systems the CPU and GPU are combined on a single chip with a single (slow) memory controller. Given that constraint, AMD is trying to at least wring as much efficiency as they can from that single cheap chip. I salute them for trying to give customers more for their money, but let's admit that this hUMA thing is not about breaking performance records.
Re: (Score:2)
nah. providing wider and faster memory will help even purely CPU codes, even those that are often quite cache-friendly. the main issue is that people want to do more GPUish stuff - it's not enough to serially recalculate your excel spreadsheet. you want to run 10k MC sims driven from that spreadsheet, and that's a GPU-like load.
but really it's not up to anyone to choose. add-in GPU cards are dying fast, and CPUs almost all have GPUs. so this is really about treating APUs honestly, rather than trying to
Security model? (Score:2)
They talk about passing pointers back and forth as though the GPU and CPU effectively share an MMU. The problem is, GPUs and CPUs don't work the same way. GPUs need to access shared resources that are per-system, whereas CPUs need to limit access to resources on a per-process basis. It would be devastating if a GPU could, for example, allow an arbitrary user-space process to overwrite parts of the kernel and inject virus code that runs with greater-than-root privilege. It would similarly be devastating
Re: (Score:2)
My understanding is that there will indeed be something like RWX control. Not just for security, but also for performance. If boths ides can freely write to a chunk of memory, you can get into difficulties accounting for caches in a fast way.
That said, if the CPU and the GPU are basically sharing an MMU, then the GPU may be restricted from accessing pages that belong to process that aren't being rendered/computed. There's no reason why two different applications should be able to clobber each other's tex
Re: (Score:2)
GPUs need to access shared resources that are per-system, whereas CPUs need to limit access to resources on a per-process basis.
If you plan to make the GPU easy to use as a general computing resource (which, according to the writeup, seems to be what they're aiming at) the GPU needs to also be working at a per-process basis and linked to the main system memory so that results are easily available to the main system for I/O, etc.
Of course, even if this is their goal, one question still remains... Will thi
Name is a pun (Score:2, Informative)
Apparently not too many finnish speakers here yet. Kaveri => partner/pal/mate, APU => help.
HTH,
ac
Re: (Score:1)
Apparently not too many finnish speakers here yet. Kaveri => partner/pal/mate, APU => help.
HTH,
ac
"Kaveri" is actually the name of a major river in Karnataka, a state in India. AMD names its cores on major rivers all around the world.
HTH.
Re: (Score:2)
Yes, buddy would be spot on. :)
But all in all, the chip's name really sounds sympathetic to the Finnish ear!
Re: (Score:2)
It's the night before Vappu. We're way too busy getting drunk in Finland.
Re: (Score:2)
It's not just Finnish. Hebrew chaver [wiktionary.org] may be the common etymology for both this and the Finnish word. It is also the origin of the Dutch word gabber [wikipedia.org].
OTOH, the pun with APU is harder to explain without Finnish.
Here's to better AI! (Score:2)
With a GPU next to the CPU the latency between them is reduced, this is awesome for OpenCL applications. Imagine you wanted to work a markov model into your AI and you needed to a large number of matrix calculations to get it to run properly and you want it in real time, I think this might solve that problem. I'm imagining game AI improving with adoption of this style of processor. Anyone see this differently?
Re: (Score:2)
I don't know... This heterogeneous computing with low latency seems interesting if it does not harm raw performance. The main advantage would be to transport data back and forth between the two. If the computation on one side is long, then the decrease in latency is not very useful. If both of them are really fast, then there is not too much to gain to begin with.
It really helps when you need fast turn around so for small and very synchronous computation. I am waiting to see one good usecase.
Re: (Score:2)
Learning AIs in games have been problematic in the past. Mostly it is about control over the experience that gets delivered to the customer: as a designer your job is to get it just right. You can do this easily with current more or less heuristic AI algorithms. The ability to learn opens the scope of possible behaviours so much that it's not possible anymore to deliver a guaranteed experience.
Short version: the designer can't stop the game from going nuts in unpredictable ways because of stupid player inpu
Re: (Score:2)
Re: (Score:2)
Kaveri is also finnish, and means "a friend", seems to describe the system decently.
heterogenous computing overrated (Score:1)
I think AMD overrate heterogenous computing. The assumption is that all applications can take advantage of GPGPU. This is simply not true. Only certain types of application are suitable, such as multimedia and simulation - where it's very obvious what part of the code can be parallelised.
This looks an awful lot like the PS4 (Score:3)
With GDDR5 memory this could be very interesting.
Re: (Score:2)
Holy crap -- has hell frozen over? Sony is actually thinking about developers for once!? Using (mostly) off the shelf commodity parts is definitely going to help win back some developers. Time will tell if "they are less evil then Microsoft"
Thanks for the great read.
SGI O2 reinvented (Score:3)
OK, so the SGI O2's UMA has now been reinvented for a new generation, just with more words tacked on....
What about the software model (Score:2)
Re: (Score:2)
Indeed it should be easier. There will still be some cost, since the processors are still in thread bundles and still trade speed for throughput, but the cost will be much lower. I expect the break even point will be pretty small though and won't have the huge disadvantage of limited memory for very large things.
I wonder what the low level locking primitives between the GPU and CPU will be. Those will have some effect on the speed.
I also wonder what/how the stream processors will be dealt with by the OS an
whaaaaaaat? (Score:1)
One address is better than two (Score:1)
Why would a graphics card want to use virtual memory?
Shared physical memory avoids the cost of copying data to and from the GPU but without shared virtual memory the data will end up at different addresses on the CPU and GPU. This means that you cannot use pointers to link parts of the data together and must rely on indexes of some sort. This makes it harder to port existing code and data structures to use GPU computation.
Also, with shared physical memory you have to tell the device which memory you want to use (so that it can tell you which address to use).
GPU/GPGPU bottleneck (Score:2)
IOMMU (Score:1)
OK so how do I (or someone) use this? (Score:1)
So how do you do this in Java, Python? Did nobody ask? I did a search for "java huma uniform memory access" and this page came up first with nothing from java.com or oracle in sight.
Ok more searching says to use OpenCL and lots of stackoverflow questions... but they're not new... and OpenCL is not Java. W
Spam Advocates (Score:3)
Re: (Score:1, Insightful)
Re: No match for Haswell (Score:3)
Re: (Score:2)
radeon 5670 with 512m onboard (cost from newegg when bought $90) plays GW, SC and all the other games Iv'e thrown at it quite handily. Will have probs if game is is heavily tesserected but that's the only time it's a prob and I run 1900x1080 (monitor native rez) and the funniest thing is - the new radeon drivers support the damn thing while my 7300GT is no longer supported by either Nvidia or Linux, even with the god damn nouveua and nv driver. The reason I still have the old Geforce 7300GT - it's fanless s
Re: (Score:1)
Ok. Noted. Either will do fine CPU-wise.
Ah. Great. So AMD is the better buy then.
Not only that, but it will save ~$100 on the CPU and ~$50 more on the motherboard. That's GREAT advice.
But no.. Then we hear this;
Re: No match for Haswell (Score:2)
Re: (Score:1)
Eg: If the user intends to get a discrete GPU, as you say, s/he will have approx $150 more to spend on the GPU if s/he picks the AMD solution. A $250 GPU vs. a $100 GPU is a pretty significant difference. Thus if graphics matter, the user should pick the AMD solution.
(*) of which there is
Re: (Score:3)
AMD beats Intel on the price point however.
And that isn't even counting that with Intel you need to buy a $100 extra card either.
If you *need* top notch performance, go Intel. Otherwise AMD will be lighter on your wallet and do the same job very well.
Re: (Score:2)
Re: (Score:2)
I think AMD's target for this architecture is a typical Walmart shopper
Partly that and partly it's way more interesting. The unified memory trades performance for flexibility (as always), but puts it in a very interesting space. Less performant than a discrete GPU with a crazy memory architecture, but puts tons more FPU grunt under the flexible memory susbsystem of easy to use CPUs.
It will make acceleration more applicable to a much wider range of tasks at the cost of being slower on some.
Due to the close c