AMD Previews New Processor Extensions 198
An anonymous reader writes "It has been all over the news today: AMD announced the first of its Extensions for Software Parallelism, a series of x86 extensions to make parallel programming easier. The first are the so-called 'lightweight profiling extensions.' They would give software access to information about cache misses and retired instructions so data structures can be optimized for better performance. The specification is here (PDF). These extensions have a much wider applicability than just parallel programming — they could be used to accelerate Java, .Net, and dynamic optimizers." AMD gave no timeframe for when these proposed extensions would show up in silicon.
Just performance counters? (Score:3, Informative)
Re: (Score:2)
Re:Just performance counters? (Score:4, Informative)
The application can then gather the information, which is stored in its address space, and do with it what it will (optimize on-the-fly).
Of particular interest is that the OS can allow the profile information to be dumped to the address space of other threads/processes as well as the one that the data is collected on. The OS controls the switching of the cached profile information during a context switch.
This is both cool (in that a secondary core/thread can help optimize the first) and scary (one thread getting access to another's instruction address information). I predict there will be exactly 42 Windows patches released 3.734 days after the service pack that allows Windows to take advantage of this feature because of security reasons.
reminds me of the PS2's PA (Score:2)
Sony had a $10k PS2 called the PA that recorded exactly what happened to every cycle on the cpu, gpu etc. without changing the way the game ran. It was the most incredible thing, like you had been sitting in the dark for years and then suddenly someone turned on the lights.
Is it cache misses, dma contention, background threads, branch stalls or actual work? Optimizing on the PC
Re: (Score:2)
I wish AMD and Intel teamed up for once (Score:3, Funny)
Yeah, I know, Intel tried with Itanium.
Re: (Score:3, Insightful)
And you want them to try *again*? As far as I'm concerned the most amazing achievement of IA64 was that they got to start over from scratch, and ended up with an ISA with a manual even bigger than the IA32 manual! Going to prove that the only thing worse than an ISA developed through 20 years of engineering hackery is one developed by committee.
Re: (Score:2)
Re: (Score:2)
It's a good thing Linus leveraged his considerable processor architecture experience while at Transmeta. Where would they be now had he not provided useful advice like that?
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
not good...
hell, if i didnt know better, i would suspect that intel was government owned, why? because they seems to overengineer to a degree that only nasa tops.
Re:I wish AMD and Intel teamed up for once (Score:4, Insightful)
The thing is, at this stage in processor design, the actual instruction set isn't all that important.
But *compilers* are more important than ever, and writing a good compiler is hard work. x86 compilers have been tweaked and improved for nearly 30 years. A new instruction set could NEVER achieve that kind of optimization.
Interestingly,the Itanium and the EPIC architecture were designed to move all the hard work of "parallel processing" to the compiler. Unfortunately, they could never get the compiler to work all that well on most kinds of code. The compiler could never really "extract" the parallelism that Itanium CPUs needed to work at full speed.
Which is *exactly* the problem we have now with our multi-core CPUs. Compilers don't know how to extract parallelism very well. It's an *incredibly* difficult problem that Intel has already thrown untold billions of dollars at. Essentially, even though Itanium/EPIC never caught on, we're having to deal with all the same problems it had, anyway.
Re: (Score:2, Interesting)
Re: (Score:3, Insightful)
Re:I wish AMD and Intel teamed up for once (Score:4, Informative)
Oh who am I kidding, that could never happen.
Re: (Score:2)
I am not kidding, that would never happen.
Map and reduce? (Score:4, Interesting)
Re: (Score:2, Interesting)
And would be terrible for performance. Why on earth does everybody assume that fine grained parallelism will ever work? You need a very highly specialized processor to make it work and those have failed a decade ago as the "standard CPUs" just blew them away. Remember the Connection Machine, that was a box with exactly that fine grain of paralleliz
Then again, schools are partly to blame (Score:2)
Re: (Score:2)
An ideal language would provide a way to express that a function has no side effects, allowing map() to farm out different slices of the array to different CPUs.
I wrote something like that [honeypot.net] for Python. The idea is that you'd use a "decorator" to indicate that a method is parallelizable (doesn't have any side effects) and roughly how many processes to spread it across (because you don't want to hit your database with 10,000 simultaneous queries just because your client could theoretically do so, for instance). For example:
would tell the multipr
Re: (Score:2)
Compilers have gotten better, but mostly at CPU-independent optimization. Compilers for x86 aren't better than compilers for other architectures, it's just that x86 CPUs are extraordinarily insensitive to mediocre code generation. The reason is two-fold. First, they kind of have to be, because x86 doesn't really have enough registers to make fancy schedulin
Re: (Score:2, Interesting)
Yeah, I know, Intel tried with Itanium.
They already did. I believe the 486 was the last CPU to run x86 instructions natively. Everything since the Pentium has decoded them to a RISC like ISA which can be changed every generation if desired. The only drawback is that a relatively small area of the chip needs to be dedicated to decoding x86 instructions to whatever the internal ISA is.
And guess what? One of the things that people d
Re: (Score:3, Informative)
Close, it was the original Pentium. The Pentium Pro -- which despite the name which just made it sound like a minor improvement to the Pentium for business/servers was actually a completely new architecture -- is where they introduced the CISC->RISC conversion. This was in part to make it feasible to have out-of-order execution which many said CISC processors would never have. Turns out they were both right and wrong.
So let's stick wi
Re: (Score:3, Informative)
and did away with the aging x86 instruction set and came up with something new.
They did, at least with the FP (floating point) instructions. FP instructions were based on this awful stack architecture, and it's gone away with all the SSE and 64 bit extensions.
The x86 instruction set has evolved greatly over time, and will continue to evolve. Why replace it entirely from scratch? Who's to say that an entirely new instruction set won't have a whole new host of problems?
Re:I wish AMD and Intel teamed up for once (Score:5, Insightful)
We have the Power and PowerPC which had a much better instruction set than the X86.
We have the ARM which is a much better instruction set then the X86.
We have the MIPS which is pretty nice.
And we had the Alpha and still do for a little while longer.
The problem with all of them is that they didn't run X86 code. Intel and AMD both made so much money from selling billions of CPUs that they could plow a lot of money into making the X86 the fastest pig with lipstick that the world has ever seen.
What made the IA-64 such a disaster was that it was slow running X86 code.
Re: (Score:2)
The number one requirement for a new instruction set is that it runs Windows and most Win32 programs at speeds comparable to existing processors. Given the size and scope of Windows, Microsoft probably can't easily port Windows and Win32 and Visual Studio's compiler over to another instruction set easily.
This means that we either need hardware or software emulation of x86 (and possibly x86-64) on whatever new instr
Re: (Score:2)
Whatever the cause is, it isn't size and scope. Practically any piece of free software compiles on a dozen architectures. For example, Debian Gnu/Linux ships around thirteen gigabytes of software for each of eleven architectures ...
Re:I wish AMD and Intel teamed up for once (Score:5, Insightful)
Why is this nonsense still perpetuated? The instruction set is irrelevant - it's just an interface to tell the processor what to do. Internally, Barcelona is a very nice RISC core capable of doing so many things at once its insane. The only thing that performs better is a GPU, and that's only because they're thrown at embarassingly parallel problems. The fastest general purpose CPUs come from Intel and AMD, and it has nothing to do with instruction set.
AMD64, and the new Core2 and Barcelona chips are very nice chips. 16 64-bit registers, 16 128-bit registers, complete IEEE-754 floating point support, integer and floating-point SIMD instructions, out-of-order execution, streaming stores and hardware prefetch. Add to that multiple cores with very fast busses, massive caches - with multichip cache coherency - and the ability to run any code compiled in the last 25 years. What's not to like?
Re: (Score:3, Insightful)
As for "the fastest general purpose CPUs come from Intel and AMD", have you ever looked at a Power5? It's stupid fast. Stupid expensive, too.
Re:I wish AMD and Intel teamed up for once (Score:5, Interesting)
Sure, now it is, since the decoding of CISC instructions into micro-ops has largely decoupled ISA from the microarchitecture, allowing many of those neat-o performance features you meantion like out-of-order execution. However in the past this wasn't the case and a lot of x86's odd behaviors that seemed like good ideas when they were made were serious performance limiters. Like a global eflags register that is only partially written by various instructions (and they always write even if the result isn't needed).
Even today, I would say that all those RISC ISAs are better than x86, simply from the standpoint that they are cleaner, easier to decode, have fewer tricky modes to deal with, fewer odd dependencies, and all the other things that make building an actual x86 chip a pain in the arse. No, in the end it makes no difference in performance. Yet, if you had it to do all over again, building the One ISA to Rule Them All without concern for software compatability, and you decided to make something that was more like x86 than Alpha, I'd slap the taste out of your mouth.
But we do have to be concerned with software compatability, and that I think was the GP's main point. All of those other ISAs failed to dominate -- even when there were actual performance implications! -- simply because they were not x86 and hence didn't run the majority of software. IA64 failed not because it was itself all that bad, but because it couldn't run x86 software well. So when AMD came out with 64-bit backward-compatible x86, everyone stopped caring about IA64. Because it wasn't x86, and AMD64 was.
So ultimately I agree with you both, and I don't think the GP was nonsense at all. It's a very valid point -- backward compatability is king, so x86 wins by default no matter what. Your point -- that x86 isn't actually hurting us anymore -- is just the silver lining on that cloud.
Re: (Score:2)
The people who really suffer from this are Intel and AMD. They're the ones that have to design the nasty decoders for x86. They obviously find the advantages of decades of expertise in x86 ISA throughout the industry is worth th
Re: (Score:3, Interesting)
This is true, they're the ones who have to make it actually work. I think who it -really- hurts is anyone who isn't Intel or AMD trying to make an x86 chip. Unfortunately there's a lot of x86 behavior that isn't actually documented -anywhere- except inside the heads of Int
Re: (Score:3, Informative)
Why is this nonsense still perpetuated? The instruction set is irrelevant - it's just an interface to tell the processor what to do...
What's not to like?
To start with, the complexity makes it a total pain in the ass to write kernels, compilers, runtime systems, analyses, debuggers and verifiers for x86. On top of that, it costs lots of engineering time, silicon and power to implement all those microcode crackers and fancy superscalar optimizations; this is why x86 can't hold a candle to ARM in the embedded world.
But maybe you meant missing instructions? No load-linked/store conditional or bus snooping. No double (or even 1.5) compare-and-swap. No hardw
Re: (Score:2)
Re: (Score:2, Insightful)
All the other features the GP mentioned, except for the last one if you mean COMPILED code, are also available on most RISC chips
Re: (Score:3)
Isn't this not true on modern processors, at least up to a point? With some space per TLB entry put aside for a task ID, means that when you switch to a different process, it will won't use TLB entries with a different task ID. Of course the OS has to support this (tell the processor when it's task switching which memory space it's switching to), and I'm not sure how big the space on the TLB is for this (it may be on
Re: (Score:3, Insightful)
Re:ARM CPUS outnumber x86 by a huge factor -probab (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Insightful)
I don't know how anything gets done using a set of words cobbled together over hundreds of years with all sorts of special rules and idioms.
Re: (Score:2)
Re: (Score:2)
if/when they get tired of sending money to Britain to pay for translators.
if/when they get tired of returning a small amount of Britain's contribution back to it to pay for translators.
There, fixed that for you...
Re: (Score:2)
especially interesting is the transmeta crusoe cpu which can load different instruction sets and translate them into its native code.
but the thing is, as far as i remember, back at those days when transmeta crusoe was just near the release, linus said something like "i compiled the linux kernel to the native crusoe vliw instructions and it was actually slower than the x86 code"
Re: (Score:3, Insightful)
I don't think that's a good idea. The internal micro-ops are machine-dependent, and they will change as the microarchitecture changes. By designing the micro-ops specific to the architecture, they can try to make the x86 instruction translate into an optimal sequence of micro-ops. As hardware functionality changes, existing x86 instructions can have the underlying ops changed to suit without you having
I think this is great (Score:3, Funny)
think this
is good
news.
Re: (Score:3, Informative)
As for 16 bit vs 32 bit modes. The instructions are mostly the same. A code segment is specified as being either 16 or 32 bit. That size is the default data sized used by instructions within that segment. There
Re: (Score:2)
Re: (Score:3, Interesting)
You can get the x86/EMT64 documentation from intel (Score:3, Informative)
Also, I know from asm on SPARC that many op codes are really just variations of other ops (and/or pseudo ops). For in
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Another fun thing is a lot of games broke in Vista because the game had the "MY DOCUMENTS" folder location hard coded.
Future looking programmers..
Re: (Score:2)
Re:Will Intel Adopt These Instructions? (Score:4, Informative)
EM64T [wikipedia.org]?
Re: (Score:2)
Re: (Score:2)
You can't access any memory without pointers.
You're probably thinking of Page Addressing Extensions (PAE), which let you swap out parts of the page tables to point to memory above 4 GB. That's existed since the Pentium Pro or so. EM64T is just the damage control name Intel's marketing department came up with for their implementation of
Re: (Score:2)
By the way, you do not need pointers to address memory, and what I had stated was that in order to address higher than 4GB of RAM, the EM64T chips have
Re: (Score:2)
Let's start with some basic facts, that you can verify for yourself by hitting the long mode specs in AMD and Intel manuals:
1) You need PAE enabled (in CR4). Long mode uses a 4-level paging table scheme (PML4 - PDPT - PD - PT, although you can get away with only using the first three levels if you are fine with a 2MB granularity.
2) The linear address space is 64 bits.
3) The physical address space, ATM AFAIK, is 52 bits, with the other bits reserved for
Re: (Score:2)
https://www.redhat.com/docs/manuals/enterprise/RHE L-3-Manual/release-notes/as-amd64/RELEASE-NOTES-U2 -x86_64-en.html [redhat.com]
From the reference itself
" Software IOTLB -- Intel® EM64T does not support an IOMMU in hardware while AMD64 processors do. This means that physical addresses above 4GB (32 bits) cannot reliably be the source or destination of DMA operations. Therefore, the Red Hat Enterprise Linux 3 Update 2 kernel "bounces" all DMA operatio
Re: (Score:2)
It simply means that when the kernel allocates buffers for data transfer to/from hardware, it has to be a little careful about where it does it. This doesn't have any impact whatsoever on userspace code.
Also, at least in the earl
Re: (Score:2)
You didn't simplify the explanation - you did not understand it, and you STI
Re: (Score:2)
Could you clarify that at all? I'm not the end-all, be-all expert on these things, but I do know enough to be sure that what you wrote is so not-correct as to not even be wrong...
Pointers really only matter from a relatively high-level software perspective. From a low level hardware perspective, you can either say that pointers don't
Re: (Score:2)
Apparently you're a mac person, so it's understandable.
woops (Score:2)
Re: (Score:2)
Prior to that, the closest thing was when NexGen (just before AMD bought them) developed an MMX-like extension for the Nx686 (released by AMD as the K6) and cut a deal for Cyrix to use them, which is what provoked Intel into creating MMX with cross-licensing to AMD and Cyrix.
Re: (Score:2)
(Or at least that's how I remember it working)
Re: (Score:2)
(*mostly)
Re: (Score:2)
Adopt: x86-64 (AMD Created, Intel adopted it when the Itanium sunk)
Co-existing features: SIMD: MMX/SSE and 3DNow! (SSE eventually won out, but they co-existed for a long time).
Virtualization: Intel VT and AMD-V co-exist today, and both are used by virtualization projects like Xen.
Re: (Score:3, Insightful)
What happened is that the P4 architecture was more of a marketing scheme to push MHz, but not performance. AMD came out with an architecture directed at high performance. Intel came out with the Core 2 products which also focused on peroformance instead of clock speed. Intel has a lead in the manufacturing process side with respect to node size. This helps them to produce a lot at a lower cost. And If you look at Intel's and AMD's financials, you'll see how much each has to spend on R&D. Intel has a lot
Re: (Score:3)
Re: (Score:2)
As I see it... The memory bandwidth limitations on Intel's FSB are so restricting that for many applications it matters little how many cores or threads their CPUS can push. The reality is that Intel's chips cannot push memory around fast enough for those processors to be worthwhile. Rather than a dual quad-core sy
Re: (Score:2)
The question I used to have is when both were using the same manufacturing process, why AMD was kicking the teeth in on the P4 line and why it took so long for Intel to catch up. It goes back and forth.
Logical reasons to buy AMD (Score:4, Insightful)
There are three reasons to buy AMD right now.
1. Price, price and price. AMD knows Intel has the better fab, but AMD is selling super cheap. You can get a dual-core processor for half what Intel charges, and for the average user, it is more than enough. I'm running Oblivion at 30 FPS with a $59 processor, and I've barely overclocked it. The cheapest Intel dual-core proc was $120 when I bought my $59 proc. Most people have no idea that their proc these days often underclocks itself, and you rarely touch the full potential of your proc. Intel is faster, and no one doubts that today, but if you never see the speed benefit, why spend the extra dollars? On a performance per dollar basis, AMD wins hands down.
2. There is a mountain of evidence against Intel for anti-trust violations, and I try not to financially support evil. The EU is also coming down on Intel for anti-trust violations.
3. Even if the anti-trust suits both come through, AMD is near bankruptcy, and I prefer choice in the marketplace. I am terrified of the day when Intel has no competition pushing them and they can just sell what they want and whatever price they want.
Re: (Score:3, Interesting)
Gosh, maybe you should go tell AMD that they aren't having any trouble with leakage, the yield of their 65nm parts is optimal and they can start volume production right now! The time AMD has spent not shipping Barcelona has been costing them dearly. Did you see the loss they posted la
Re: (Score:2)
Intel demanded that people not carry or display AMD products, or they'd refuse to ship product they already purchased. That is pretty clearly evil.
Intel doesn't have to buy AMD's IP. If AMD goes belly up, then Intel will have an unchallenged monopoly, and no one has suggested trying to compete with them.
Barcelona is late, and Intel does have a better manufacturing process. No one is contesting either of these points, but cheap AMD processors are reaching the 3
Re: (Score:2)
No, I'm stating some inconvenient truths.
Intel demanded that people not carry or display AMD products, or they'd refuse to ship product they already purchased. That is pretty clearly evil.
It's an alleged evil. Since the only major Intel-only brand in the US was Dell, I don't find it a particularly compelling case of evil. In fact it is a pretty short walk from saying Intel was manipulating Dell to saying Dell was manipulating Intel (hey Intel, I hear AMD has some p
Re: (Score:2)
Re: (Score:2, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Funny)
good times. I guess I'll have to start wearing pants now though.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Funny)
Re: (Score:2)
Oddly enough, the same code can often be compiled cross-architecture and cross-platform quite easily on GCC that provides a nice, fast executable native to each platform and architecture and it uses a fraction of the start-up speed and resources of Java.
I'm a crappy programmer, and even that is transparent to me.
Re: (Score:2)
AND well-written software. What, you think you could write code that's just as fast without all the "hardware acceleration" being done for you, without using any instruction set extensions that have been added over the years? You are on crack.
Instead of devoting transistors to speed up the latest toy programming languages ('managed' code), why can't we just train programmers better?
And better profiling tools are contrary to this goal how
Re: (Score:2)
There's a matter of degree, to be sure, but even still you're most likely wasting your time "optimizing" individual lines of C code since the compiler can probably do a better job and that's been the case for quite a while.
Terrible, if people start to give up to optimize the code (and understanding why it works), the net result will always be a noticeable decrease in programming quality (a very usual situation).
I know that you are aiming at premature optimization, and you are really right on this one, b
Re: (Score:2)
Re:Nothing special for Java or .NET (Score:5, Insightful)
Re: (Score:2)
Ok, sorry, wrong, and yes, wrong again...
The notes about
The reason it would benefits these environments is because they are processed on the fly and the environment could make the 'adjustments' to the code at runtime instead of it be 'locked' as natively compiled
Re: (Score:2)
The advantag
Re: (Score:2)
Sure there are. A profiler could quickly pick up on a function that's getting called many times from within a loop, and decide it could speed it up more by inlining it. Or, a bit of inline code that isn't being used often could be moved out of line, so the rest of the loop fits into a single cache line.
Re: (Score:2)
I don't disagree with the notion that any natively compiled language could be scaled to take advantage of this, a good solution would be an OS level scheduling mechanism for natively compiled applications that could make the decisions based on the information the AMD instructions would be offering.
However, the reference you cite is more about basic instruction changing and not the dynamics of testing to see what threads are bus
Re: (Score:2)
I think you didn't read the spec. All that information is only available to the thread that is profiled; everything is context-switched so it can't leak out to other threads and definitely not to other processes.
Re: (Score:2, Funny)
I see all fuss about programming. easy. don't what the is parallel It's
Re: (Score:2)
Re: (Score:2)
They've been increasing bandwidth while adding cores, and those cores also happen to have things like L1 and L2 caches, and so forth.
C//