First 16-Core Opteron Chips Arrive From AMD 189
angry tapir writes "After a brief delay and more than a year of chatter, Advanced Micro Devices has announced the availability of its first 16-core Opteron server chips, which pack the largest number of cores available on x86 chips today. The new Opteron 6200 chips, code-named Interlagos, are 25 per cent to 30 per cent faster than their predecessors, the 12-core Opteron 6100 chips, according to AMD."
Wish List (Score:4, Informative)
sandy bridge ep 95W (Score:4, Informative)
There will be server versions as well...I've seen specs (publicly available) for an 8-core (16-thread) sandy bridge EP with a 95W TDP. I suspect it's clocked a bit lower and maybe binned for efficiency.
Re:Compared to Intel? (Score:5, Informative)
Re:Only 16? (Score:2, Informative)
Pffft, it's only 8 cores anyway, 8 cores each with 2 integer units. It's no more 16 core than intel's 8 cores with hyperthreading.
Re:Bulldozer Cores are not that Great (Score:5, Informative)
Your description in inaccurate, but that's not surprising since most slashdot readers don't know much about CPU architecture.
Bulldozers are essentially full-fledged cores, where the two cores in each module are mostly independent. There are two completely independent integer pipelines, so people seem to want to harp on the fact that the FPU is "shared". It's really a single split FPU, where each half can execute independent instructions, as long as the data width is 128 bits or less. Only when it is executing 256-bit AVX instructions is there any competition for resources. This is a very sensible design decision, since you don't find enough AVX software right now to justify completely dedicated AVX logic. (Plus, IIRC sandy bridge's FPU is only 128 bits wide and issues AVX instructions in two cycles, so what's the difference?) Moreover, even with AVX-heavy workloads, most software won't issue AVX instructions every cycle, and two AVX-heavy tasks on the same module won't really run into much contention. Assuming my memory of Sandy Bridge's FPU is correct, then Bulldozer has the advantage of having lower latency within the FPU on isolated AVX instructions.
The PROBLEM with Bulldozer is that they just have not done some of the really aggressive and costly things that Intel has done in their design. Bulldozer is still a 3-issue design. While going to 4-issue doesn't help that much that often, it still gives Sandy Bridge a slight edge. But where SB REALLY gets its advantage is the huge instruction window. Intel found clever ways to shrink the logic for various components so that they could make room for a much larger physical register file and reorder buffer. As a result, SB can have many more decoded instructions in flight, which exposes more instruction-level parallelism and, critically, absorbs more memory access latency.
A Sun engineer (discussing Rock, among other things) once described modern CPU execution as a race between last-level cache misses. When you have a miss on your L3 cache, it can cost hundreds of cycles, upwards of 1000. During that miss, the CPU fills up its reservation station with other instructions and then stalls, waiting on something to retire. This won't happen for a long time. Because of the disparity in speed (and latency) between compute and memory access, this is typically the most significant bottleneck. By enlarging the instruction window, SB can achieve much higher throughput, and it shows in the benchmarks.
This is Bulldozer's Achilles' heel. I know there are a few benchmarks where Bulldozer is faster than SB, but they're not typical workloads with typical memory footprints. Anyhow, so if you're going to rag on Bulldozer, rag on it for the right reasons. Bulldozer's "shared" FPU is a red herring.
Re:Only 16? (Score:4, Informative)
No, 8 integer cores per chip, but 4 actual real cores. For a total of 8 cores across 2 chips.
Re:Bulldozer Cores are not that Great (Score:5, Informative)
The OP right, and seems to understand the issues far better than you. It isn't that the FPU is shared, it that nearly _everything_ is shared: Instruction cache, fetch and decode, FPU, L2 data cache. The only things that aren't shared are L1 data and integer operations (scheduler and ALU).
Instruction issuing and and cache misses are big performance areas, but these are precisely the resources the cores share! You're running two threads off (with the exception of L1 data) the same caches and instruction fetches. So, in reality, the second core in bulldozer is much more like ultra-hyperthreading than it is a second core. I think the fact that they're even listed as cores is a marketing strategy that has backfired pretty hard.
P.S. L3 cache has proven to be quite useless in many workloads... It helps a bit in servers, IIRC, but that's about it. So it's more a race to L2 cache, which, again, is a shared resource. AMD, in fact, has indicated that it may drop the L3 from desktop parts.
Re:When are multiple cores going to help me? (Score:4, Informative)
You're doing it wrong.
make -j8
Re:how do they compare ? (Score:5, Informative)
-mainconcept http://www.lostcircuits.com/mambo//i...&limitstart=17 [lostcircuits.com]
-mediashow http://www.guru3d.com/article/amd-fx...ssor-review/14 [guru3d.com]
-h.264 http://www.guru3d.com/article/amd-fx...ssor-review/14 [guru3d.com]
-vp8 http://www.guru3d.com/article/amd-fx...ssor-review/17 [guru3d.com]
-sha1 http://www.guru3d.com/article/amd-fx...ssor-review/17 [guru3d.com]
-photoshop cs5 http://www.lostcircuits.com/mambo//i...&limitstart=14 [lostcircuits.com]
-photoshop cs5 http://www.tomshardware.com/reviews/...x,3043-15.html [tomshardware.com]
-winrar, faster than 2600k http://www.techspot.com/review/452-a...pus/page7.html [techspot.com]
-winrar, improves over x6 http://www.tomshardware.com/reviews/...x,3043-16.html [tomshardware.com]
-7-zip better than 2600k here: http://images.anandtech.com/graphs/graph4955/41698.png [anandtech.com] http://www.anandtech.com/show/4955/t...x8150-tested/7 [anandtech.com]
-7-zip same perf as 2600k http://www.tomshardware.com/reviews/...x,3043-16.html [tomshardware.com]
-POV-ray, faster than 2600k http://www.legitreviews.com/article/1741/10/ [legitreviews.com]
-POV-ray http://www.nordichardware.se/test-la...art=15#content [nordichardware.se]
-x264(2nd pass AVX enabled) http://www.anandtech.com/show/4955/t...x8150-tested/7 [anandtech.com]
-x264 (2nd pass, better overall than 2600k) http://www.bjorn3d.com/read.php?cID=2125&pageID=11108 [bjorn3d.com]
-x264 (2nd pass +.3 than SB2600k) http://www.legitreviews.com/article/1741/7/ [legitreviews.com]
-handbrake; http://www.legitreviews.com/article/1741/9/ [legitreviews.com]
-truecrypt; http://www.bjorn3d.com/read.php?cID=2125&pageID=11111 [bjorn3d.com]
-solidworks; faster than 2600k http://www.techspot.com/review/452-a...pus/page7.html [techspot.com]
-abbyy filereader http://www.tomshardware.com/reviews/...x,3043-16.html [tomshardware.com]
-C-Ray, as fast as $1k i7-990X, http://i664.photobucket.com/albums/v.../c-rayir38.png [photobucket.com]
Re:When are multiple cores going to help me? (Score:4, Informative)
What do you mean by "only one of my compilers actually takes advantage of the multiple cores when it is compiling"?
Are you on Windows? Because any compiling done in linux with a "make" based (or similar) build system can use as many cores as you can throw in a machine (regardless of the actual compiler it's running). It should be the same in Windows...
Don't look to your compiler to be multithreaded... look at the build system (i.e. in Visual Studio there should be an option somewhere to tell it how many processors to use while compiling). For make you just do "make -j8" to use 8 "jobs" total for compiling (i.e. 8 instances of the compiler will be running).
Here is a test for one of my software projects doing "make -j#" where # is 1,4,8,12,16,24:
1 : 15m9.614s
4 : 3m57.947s
8 : 2m6.354s
12 : 1m33.426s
16 : 1m25.559s
24 : 1m17.345s
That is on my dual 6-core hyperthreaded Mac workstation (so it had 12 "real" cores and 12 "hyperthreads"). You can see that hyperthreads definitely aren't as good as real cores... but do provide some speedup. That said, I thank God every time I compile (which is all day long) for the cores he has bestowed upon me...
Good to hear that you are already on SSD... because parallel compiling does need speedy disk to keep the processors humming. The timings above are for two 256GB SSD's in RAID0.
Re:Only 16? (Score:4, Informative)