Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
AMD Hardware

AMD Previews New Processor Extensions 198

Posted by kdawson
from the parallel-universes dept.
An anonymous reader writes "It has been all over the news today: AMD announced the first of its Extensions for Software Parallelism, a series of x86 extensions to make parallel programming easier. The first are the so-called 'lightweight profiling extensions.' They would give software access to information about cache misses and retired instructions so data structures can be optimized for better performance. The specification is here (PDF). These extensions have a much wider applicability than just parallel programming — they could be used to accelerate Java, .Net, and dynamic optimizers." AMD gave no timeframe for when these proposed extensions would show up in silicon.
This discussion has been archived. No new comments can be posted.

AMD Previews New Processor Extensions

Comments Filter:
  • by Erich (151) on Wednesday August 15, 2007 @03:42PM (#20241243) Homepage Journal
    Looks like there isn't a whole lot there that you couldn't get using existing performance counters and a tool like oprofile....
    • by pipatron (966506)
      But this could probably do it dynamic, in realtime, which might be nice. Dunno, didn't RTFA of course.
    • by imgod2u (812837) on Wednesday August 15, 2007 @05:34PM (#20242447) Homepage
      Looking at the PDF, it supposedly gathers profile data in the background (in local caches on the chip itself) and dumps periodically depending on the OS/application settings. This allows it to profile on-the-fly with very little impact on application performance.

      The application can then gather the information, which is stored in its address space, and do with it what it will (optimize on-the-fly).

      Of particular interest is that the OS can allow the profile information to be dumped to the address space of other threads/processes as well as the one that the data is collected on. The OS controls the switching of the cached profile information during a context switch.

      This is both cool (in that a secondary core/thread can help optimize the first) and scary (one thread getting access to another's instruction address information). I predict there will be exactly 42 Windows patches released 3.734 days after the service pack that allows Windows to take advantage of this feature because of security reasons.
    • Looks like there isn't a whole lot there that you couldn't get using existing performance counters and a tool like oprofile....

      Sony had a $10k PS2 called the PA that recorded exactly what happened to every cycle on the cpu, gpu etc. without changing the way the game ran. It was the most incredible thing, like you had been sitting in the dark for years and then suddenly someone turned on the lights.

      Is it cache misses, dma contention, background threads, branch stalls or actual work? Optimizing on the PC

    • I wonder if this isn't part of the series of changes announced at MS TechEd, where it was said the Ring 0 (Kernel) instructions would be emulated to provide a bit of a speed-up for the VS Hypervisor. It was said that both Intel and AMD were preparing designs to support virtualisation in silicon. That would put it out somewhere near the end of 2007 I think.
  • by rolfwind (528248) on Wednesday August 15, 2007 @03:43PM (#20241261)
    and did away with the aging x86 instruction set and came up with something new.

    Yeah, I know, Intel tried with Itanium.
    • Re: (Score:3, Insightful)

      by Chris Burke (6130)
      Yeah, I know, Intel tried with Itanium.

      And you want them to try *again*? As far as I'm concerned the most amazing achievement of IA64 was that they got to start over from scratch, and ended up with an ISA with a manual even bigger than the IA32 manual! Going to prove that the only thing worse than an ISA developed through 20 years of engineering hackery is one developed by committee.
      • by gilesjuk (604902)
        Indeed, devices at the lowest level don't always look that pretty. As Linus said, with Itanium Intel threw away all the good bits.
        • by dfghjk (711126)
          "As Linus said, with Itanium Intel threw away all the good bits."

          It's a good thing Linus leveraged his considerable processor architecture experience while at Transmeta. Where would they be now had he not provided useful advice like that?
          • by Chris Burke (6130)
            They'd have been even worse off even sooner than what actually happened. Any other questions?
      • It's like the saying goes: None of us is as dumb as all of us....
      • by hitmark (640295)
        and this is the same corp that came up with ACPI and EFI, iirc.
        not good...

        hell, if i didnt know better, i would suspect that intel was government owned, why? because they seems to overengineer to a degree that only nasa tops.
    • by realmolo (574068) on Wednesday August 15, 2007 @03:58PM (#20241381)
      Yup. They tried it with Itanium, and it didn't work.

      The thing is, at this stage in processor design, the actual instruction set isn't all that important.

      But *compilers* are more important than ever, and writing a good compiler is hard work. x86 compilers have been tweaked and improved for nearly 30 years. A new instruction set could NEVER achieve that kind of optimization.

      Interestingly,the Itanium and the EPIC architecture were designed to move all the hard work of "parallel processing" to the compiler. Unfortunately, they could never get the compiler to work all that well on most kinds of code. The compiler could never really "extract" the parallelism that Itanium CPUs needed to work at full speed.

      Which is *exactly* the problem we have now with our multi-core CPUs. Compilers don't know how to extract parallelism very well. It's an *incredibly* difficult problem that Intel has already thrown untold billions of dollars at. Essentially, even though Itanium/EPIC never caught on, we're having to deal with all the same problems it had, anyway.
      • Re: (Score:2, Interesting)

        by Anonymous Coward
        IBM's PPC compiler kicked the shit out of every x86 compiler. (Apples and oranges, but the quality was much better). Same for ARM's compiler and Sun's (SPARC) compiler. Fact is, x86 is the ugly girl at the party, but it gets more attention from GCC, MS, Intel, etc. Native compilers on other architectures beat the shit out of it.
        • Re: (Score:3, Insightful)

          by jguthrie (57467)
          Okay, I'll feed the troll. Tell me where I can buy an ATX (or smaller) PPC motherboard and CPU new for, oh, say $200, and I'll look at PPC again. The reason that x86 gets all the software is because it's the cheapest, it's the cheapest because all the motherboard manufacturers make motherboards for it, and all the motherboard manufacturers make motherboards for it because it gets all the software.
        • by x2A (858210) on Wednesday August 15, 2007 @08:01PM (#20243815)
          So what we need really is a "native" x86 compiler, say, from Intel, that would maybe outperform the multi-platform GCC compiler... an Intel C/C++ Compiler, or 'ICC' we could call it... maybe...

          Oh who am I kidding, that could never happen.

          • by Carewolf (581105)
            And one that doesn't artificially limit the performance of other (let's say non-Intel) x86 CPU's.

            I am not kidding, that would never happen.
      • Map and reduce? (Score:4, Interesting)

        by tepples (727027) <tepples@[ ]il.com ['gma' in gap]> on Wednesday August 15, 2007 @06:40PM (#20243117) Homepage Journal

        Compilers don't know how to extract parallelism very well. It's an *incredibly* difficult problem
        It's not that compilers can't extract parallelism. It's that the C and C++ language standards lack a way to express parallelism. Often, you want to compute a function for each element in an array, resulting in a new array. In some languages, this is called map(). In Python, this is [expression_involving(el) for el in some_list]. An ideal language would provide a way to express that a function has no side effects, allowing map() to farm out different slices of the array to different CPUs. However, iterators in C++ and many other popular languages assume that the computation may have side effects, and provide no way inside the standard language to ask the compiler to break the computation into slices.
        • Re: (Score:2, Interesting)

          An ideal language would provide a way to express that a function has no side effects, allowing map() to farm out different slices of the array to different CPUs.

          And would be terrible for performance. Why on earth does everybody assume that fine grained parallelism will ever work? You need a very highly specialized processor to make it work and those have failed a decade ago as the "standard CPUs" just blew them away. Remember the Connection Machine, that was a box with exactly that fine grain of paralleliz

          • You will have to learn to handle the parallelism. It takes different algorithms and a different way to structure programs.
            Why are these parallel algorithms not taught in university computer science classes from day 1?

            Those languages have been around for ever, functional programming languages can be parallelized automatically. So if they make it so much easier, why aren't they not used?
            Educational inertia probably makes up a large part of it.
        • An ideal language would provide a way to express that a function has no side effects, allowing map() to farm out different slices of the array to different CPUs.

          I wrote something like that [honeypot.net] for Python. The idea is that you'd use a "decorator" to indicate that a method is parallelizable (doesn't have any side effects) and roughly how many processes to spread it across (because you don't want to hit your database with 10,000 simultaneous queries just because your client could theoretically do so, for instance). For example:

          @parallelizable(10, perproc=4)
          def timestwo(x, y): return (x + y) * 2

          print map(timestwo, [1, 2, 3, 4], [7, 8, 9, 10])

          would tell the multipr

      • by be-fan (61476)
        But *compilers* are more important than ever, and writing a good compiler is hard work. x86 compilers have been tweaked and improved for nearly 30 years.

        Compilers have gotten better, but mostly at CPU-independent optimization. Compilers for x86 aren't better than compilers for other architectures, it's just that x86 CPUs are extraordinarily insensitive to mediocre code generation. The reason is two-fold. First, they kind of have to be, because x86 doesn't really have enough registers to make fancy schedulin
    • Re: (Score:2, Interesting)

      by Slashcrap (869349)
      and did away with the aging x86 instruction set and came up with something new.

      Yeah, I know, Intel tried with Itanium.


      They already did. I believe the 486 was the last CPU to run x86 instructions natively. Everything since the Pentium has decoded them to a RISC like ISA which can be changed every generation if desired. The only drawback is that a relatively small area of the chip needs to be dedicated to decoding x86 instructions to whatever the internal ISA is.

      And guess what? One of the things that people d
      • Re: (Score:3, Informative)

        by Chris Burke (6130)
        I believe the 486 was the last CPU to run x86 instructions natively.

        Close, it was the original Pentium. The Pentium Pro -- which despite the name which just made it sound like a minor improvement to the Pentium for business/servers was actually a completely new architecture -- is where they introduced the CISC->RISC conversion. This was in part to make it feasible to have out-of-order execution which many said CISC processors would never have. Turns out they were both right and wrong.

        So let's stick wi
    • Re: (Score:3, Informative)

      by Vellmont (569020)

      and did away with the aging x86 instruction set and came up with something new.

      They did, at least with the FP (floating point) instructions. FP instructions were based on this awful stack architecture, and it's gone away with all the SSE and 64 bit extensions.

      The x86 instruction set has evolved greatly over time, and will continue to evolve. Why replace it entirely from scratch? Who's to say that an entirely new instruction set won't have a whole new host of problems?
    • by LWATCDR (28044) on Wednesday August 15, 2007 @04:03PM (#20241437) Homepage Journal
      Well we had the 68000 family which had much better instruction set then the X86.
      We have the Power and PowerPC which had a much better instruction set than the X86.
      We have the ARM which is a much better instruction set then the X86.
      We have the MIPS which is pretty nice.
      And we had the Alpha and still do for a little while longer.
      The problem with all of them is that they didn't run X86 code. Intel and AMD both made so much money from selling billions of CPUs that they could plow a lot of money into making the X86 the fastest pig with lipstick that the world has ever seen.
      What made the IA-64 such a disaster was that it was slow running X86 code.

      • I don't know why you aren't modded +5 (at the moment anyway), but you're precisely correct.

        The number one requirement for a new instruction set is that it runs Windows and most Win32 programs at speeds comparable to existing processors. Given the size and scope of Windows, Microsoft probably can't easily port Windows and Win32 and Visual Studio's compiler over to another instruction set easily.

        This means that we either need hardware or software emulation of x86 (and possibly x86-64) on whatever new instr
        • by jgrahn (181062)

          Given the size and scope of Windows, Microsoft probably can't easily port Windows and Win32 and Visual Studio's compiler over to another instruction set easily.

          Whatever the cause is, it isn't size and scope. Practically any piece of free software compiles on a dozen architectures. For example, Debian Gnu/Linux ships around thirteen gigabytes of software for each of eleven architectures ...

      • by Criffer (842645) on Wednesday August 15, 2007 @05:00PM (#20242079)
        Not again.

        Why is this nonsense still perpetuated? The instruction set is irrelevant - it's just an interface to tell the processor what to do. Internally, Barcelona is a very nice RISC core capable of doing so many things at once its insane. The only thing that performs better is a GPU, and that's only because they're thrown at embarassingly parallel problems. The fastest general purpose CPUs come from Intel and AMD, and it has nothing to do with instruction set.

        AMD64, and the new Core2 and Barcelona chips are very nice chips. 16 64-bit registers, 16 128-bit registers, complete IEEE-754 floating point support, integer and floating-point SIMD instructions, out-of-order execution, streaming stores and hardware prefetch. Add to that multiple cores with very fast busses, massive caches - with multichip cache coherency - and the ability to run any code compiled in the last 25 years. What's not to like?
        • Re: (Score:3, Insightful)

          by Chirs (87576)
          The instruction set *is* relevent to low-level designers. Working with the PowerPC instruction set is much nicer than x86...for me at least.

          As for "the fastest general purpose CPUs come from Intel and AMD", have you ever looked at a Power5? It's stupid fast. Stupid expensive, too.
        • by Chris Burke (6130) on Wednesday August 15, 2007 @06:18PM (#20242897) Homepage
          Why is this nonsense still perpetuated? The instruction set is irrelevant - it's just an interface to tell the processor what to do.

          Sure, now it is, since the decoding of CISC instructions into micro-ops has largely decoupled ISA from the microarchitecture, allowing many of those neat-o performance features you meantion like out-of-order execution. However in the past this wasn't the case and a lot of x86's odd behaviors that seemed like good ideas when they were made were serious performance limiters. Like a global eflags register that is only partially written by various instructions (and they always write even if the result isn't needed).

          Even today, I would say that all those RISC ISAs are better than x86, simply from the standpoint that they are cleaner, easier to decode, have fewer tricky modes to deal with, fewer odd dependencies, and all the other things that make building an actual x86 chip a pain in the arse. No, in the end it makes no difference in performance. Yet, if you had it to do all over again, building the One ISA to Rule Them All without concern for software compatability, and you decided to make something that was more like x86 than Alpha, I'd slap the taste out of your mouth.

          But we do have to be concerned with software compatability, and that I think was the GP's main point. All of those other ISAs failed to dominate -- even when there were actual performance implications! -- simply because they were not x86 and hence didn't run the majority of software. IA64 failed not because it was itself all that bad, but because it couldn't run x86 software well. So when AMD came out with 64-bit backward-compatible x86, everyone stopped caring about IA64. Because it wasn't x86, and AMD64 was.

          So ultimately I agree with you both, and I don't think the GP was nonsense at all. It's a very valid point -- backward compatability is king, so x86 wins by default no matter what. Your point -- that x86 isn't actually hurting us anymore -- is just the silver lining on that cloud.
          • by truesaer (135079)
            Even today, I would say that all those RISC ISAs are better than x86, simply from the standpoint that they are cleaner, easier to decode, have fewer tricky modes to deal with, fewer odd dependencies, and all the other things that make building an actual x86 chip a pain in the arse.

            The people who really suffer from this are Intel and AMD. They're the ones that have to design the nasty decoders for x86. They obviously find the advantages of decades of expertise in x86 ISA throughout the industry is worth th

            • Re: (Score:3, Interesting)

              by Chris Burke (6130)
              The people who really suffer from this are Intel and AMD. They're the ones that have to design the nasty decoders for x86. They obviously find the advantages of decades of expertise in x86 ISA throughout the industry is worth the effort.

              This is true, they're the ones who have to make it actually work. I think who it -really- hurts is anyone who isn't Intel or AMD trying to make an x86 chip. Unfortunately there's a lot of x86 behavior that isn't actually documented -anywhere- except inside the heads of Int
        • Re: (Score:3, Informative)

          Why is this nonsense still perpetuated? The instruction set is irrelevant - it's just an interface to tell the processor what to do...

          What's not to like?

          To start with, the complexity makes it a total pain in the ass to write kernels, compilers, runtime systems, analyses, debuggers and verifiers for x86. On top of that, it costs lots of engineering time, silicon and power to implement all those microcode crackers and fancy superscalar optimizations; this is why x86 can't hold a candle to ARM in the embedded world.

          But maybe you meant missing instructions? No load-linked/store conditional or bus snooping. No double (or even 1.5) compare-and-swap. No hardw

        • by LWATCDR (28044)
          I don't believe that ISA doesn't matter. If for no other reason than the X86 has a real shortage of GP registers. To gain the extra registers you must run in 64 bit mode so you must live with 64 bit addressing even if you really don't need it. As you said the X86 is fast which is also what I said. The ISA is very messy and and a real pain to write code for. There will always be some people that must write assembly. Yes the x86 is really fast even without a good ISA. It is also be updated over the years to
      • Re: (Score:3, Insightful)

        by wonkavader (605434)
        No, the problem with the IA-64 was not that it was slow running x86 code. The problem was that it was slow running x86 code and not that great at running non-x86 code. Spectacular performance on non-x86 would have made it a much greater success, but it was lackluster from the start. After so long spent on designing a new chip, you'd expect some real results -- it was not much better than the alternatives. "Why bother?", the world said, and says even now.
    • by nbert (785663)
      Not that it would make much of a difference - in the end most of the instruction set won't be used by programmers and especially compilers (CISC vs. RISC anyone?). But to get back to the topic: The overhead caused by upwards compatibility isn't that big after all. Problems a normal user experiences are not caused by bad hardware design nowadays.
    • by Ant P. (974313)
      The thing is, what would they replace it with that they can sell? The only choices are emulation or translating code on the fly, both of which have sunk already.
    • Re: (Score:3, Insightful)

      by servognome (738846)

      and did away with the aging x86 instruction set and came up with something new.
      I wish they'd do away with English and come up with something new - a language based on consistant & logical rules.
      I don't know how anything gets done using a set of words cobbled together over hundreds of years with all sorts of special rules and idioms.
      • by rolfwind (528248)
        Yes, it's called German:) (Actually, English stems from it.)
  • by P3NIS_CLEAVER (860022) on Wednesday August 15, 2007 @04:25PM (#20241711) Journal
    I for one
    think this
    is good
    news.

Any given program, when running, is obsolete.

Working...