Ars Dissects POWER5, UltraSparc IV, and Efficeon 176
Burton Max writes "There's an interesting article here at Ars about the POWER5, UltraSparc IV, and Efficeon CPUs. It's a self-styled "overview of three specific upcoming processors: IBM's POWER5, Sun's UltraSparc IV, and Transmeta's Efficeon. " I found the insights as to Efficeon (successor to Crusoe) to be particularly good (although it paints a sad picture of Transmeta, methinks)."
Good article (Score:5, Interesting)
Re:Good article (Score:5, Interesting)
Re:Good article (Score:3, Informative)
In fact, the current Intel processor roadmap [intel.com] shows the same Itanium 2 processor for the first half of 2004 as it did for the second half of 2003.
Re:Good article (Score:3, Informative)
"To get the "hyperthreading" effect of two processors on one chip, Sun stuck two full-blown UltraSparc III cores on a single chip, which is chip-pin compatible with the UltraSparc III."
He assumes the interested reader will already know something about the UltraSparc III. Sun didn't fundamentally change the chip architecture. Also the Itanium architecture is already discussed ad-nauseum in other articles. It wasn't meant to be a ba
Re:Good article (Score:4, Informative)
Probably the most significant outcome of the USIV will be 212-CPU Sun Fire 15K servers. That seems to imply something like 5 or 6 CPUs per rack-unit (although it appears the 15K is somewhat bigger than a standard rack).
Re:Good article (Score:2)
As for rack sizes, the 15K racks are about the same size as normal racks, but are slightly deeper. The system is not like a standard rack +
brain fart while reading the article (Score:2, Funny)
"This is why the advances that have the most striking impact on the nature and function of the computer are the ones that move data closer to the functional units. A list of such advances might look something like: DRAM, PCI, on-die caches, DDR signaling, and even the Internet"
For a second there, I thought that the list of advances started with DRM, not DRAM, and I almost had a heart attack.
Re:brain fart while reading the article (Score:1)
Re:brain fart while reading the article (Score:1)
Re:brain fart while reading the article (Score:1)
Transmeta is a joke (Score:3, Insightful)
Just like that article yesterday on their new chip. Did they ever cite a single benchmark? NO.
The basic performance of your CPU product, as measured by industry standard benchmarks, is essential knowledge.
I was under NDA on the previous gen Transmeta stuff. It was amusing how the other OEMs reacted - it was crap, but nobody could say anything in public.
Re:Transmeta is a joke (Score:2)
That's because they aren't going for speed. They are going for low power consumption. To compare Transmeta to Intel based purely on speed would be missing the point entirely.
Sun? (Score:4, Interesting)
Re:Sun? (Score:1)
I am sorry to break this to you, but the 12"PB has a Motorolla chip in it...
it just had to be said....
Re:Sun? (Score:2)
Re:Sun? (Score:5, Insightful)
I don't mean to burst your bubble, but your 12" PowerBook uses a Motorola processor, not an IBM one. I own a 15" PowerBook though and I love it.
That having been said, the IBM PPC 970 or G5 is breathing new life into the PowerMac line and Apple is doing really well because of it. I can't wait until they get it stuffed into a PowerBook.
Re:Sun? (Score:2)
Re:Sun? (Score:3, Insightful)
"I don't like or use it so one else does"
Real smart.
Any idea the amount of Sun systems are out there? People who use Sun hardware and software, and *gasp*, like it?! Should we only evaluate chips that currentlydo ok in the slashdot market?
Re:Sun? (Score:2)
rTransmeta? (Score:1)
What's this?
Re:rTransmeta? (Score:1)
Re:rTransmeta? (Score:1)
One Power 5... (Score:5, Interesting)
This means that in a (say) 512 processor box the OS will have to handle 2048 processors efficiently. That's placing a lot of control in the hands of the software designers, and a lot of money in the hands of the companies that license per processor.
On the other hand, UNIX is getting pretty efficnelt at scaling to large systems, perhaps it (and by extension Linux thanks to SGI and IBM) will be able to handle it with no problems. One thread per processor on a desktop system might prove to be quite efficient
Re:One Power 5... (Score:4, Interesting)
Re:One Power 5... (Score:2)
Re:One Power 5... (Score:3, Informative)
Re:One Power 5... (Score:3, Insightful)
Re:One Power 5... (Score:2)
Well (Score:2)
Re:Well (Score:2)
Actually, that's what a support contract is for. The bigger problem is availability. Each brick requires four processors, plus the various work to mold all the interconnects into place. The yield on a process like that can't be very high. Not to mention all the custom parts that would be needed to fit a chip like this.
In other words, if my processor fails, there's
Re:Well (Score:1)
Re:Well (Score:1)
IBM won't be making these things to order, the minute your RS/6000 (p-series) looses a processor
in the brick a CE will be out that day to fix it.
Re:Well (Score:1)
Re:Well (Score:2)
Re:One Power 5... (Score:2, Interesting)
Re:One Power 5... (Score:3, Interesting)
So, IBM is taking away the ability to hot swap individual chips in exchange for... what? That's the big question. If there's some major improvement in the design, say so! Inquiring minds want to know!
Re:One Power 5... (Just a matter of scale...) (Score:3, Insightful)
One of the first computers I built had individual TTL parts (74xx type things) to make the CPU. If I fried on of those, I would just replace that single part and be going again. No need to replace the whole CPU.
I, for one, would never go back to that. Not just the size but the performance and the cost.
It used to be that I would buy 4K-bit RAM chips. Buy 8 of those to make a 8x4K RAM array (4K bytes) and then add a simple address decoder
Re:One Power 5... (Score:4, Informative)
What is gained is full-speed interconnect between processors within the same module. No "multipliers" - the bus between the cores within the module run at chip speeds. The timings are so tight at 2+ GHz that this is simply impossible to do with individual chips.
-Isaac
Re:One Power 5... (Score:1)
Re:One Power 5... (Score:2)
Now if you'll excuse me, I need to see how useful these new chips are as boat anchors...
Re:One Power 5... (Score:2)
I have a 64 processor machine chugging along for years on end, I have a reasonably good chance of seeing a failure. (Particularly when chips come from a bad batch.)
So source bricks from the same batch and source multi-brick systems from different batches. If you have to toss the whole brick at once, it's best to keep the stuff that's more likely to fail on that brick.
Re:One Power 5... (Score:1)
Re:One Power 5... (Score:2)
Re:One Power 5... (Score:2)
Re:One Power 5... (Score:1, Informative)
Re:One Power 5... (Score:3, Insightful)
Fortunately for IBM, they are both the hardware designers and, frequently, the software designers. They can ensure that their big iron will be supported by software.
Re:One Power 5... (Score:2)
Performance != marketshare (Score:5, Insightful)
The "hyperthreading" thing. (Score:4, Interesting)
It's amusing seeing this. It reflects mostly that Microsoft has finally managed to ship in volume OSs that can do more than one thing at a time. (Bear in mind that most of Microsoft's installed base is still Windows 95/98/ME. Transitioning the customer base to NT/Win2K/XP has gone much more slowly than planned.)
But Microsoft takes the position that if have multiple CPUs, you have to pay more to run their software. So these strange beasts with multiple decoders sharing ALU resources emerge.
Re:The "hyperthreading" thing. (Score:2)
Microsoft will eventually provide XP Home in an SMP flavor, it's only a matter of time. Perhaps they will have an HT edition before that happens. But SMP for free is just another selling point for Linux, so they won't let it be a sticking point forever.
Re:The "hyperthreading" thing. (Score:1)
Re:The "hyperthreading" thing. (Score:2)
If these were x86 chips I think the licensing question would be valid, but since they're not...
Re:The "hyperthreading" thing. (Score:1)
Those heady days before they broke the whole point of NT (dividing the kernel form the hardware layer.) But before they could make a stable OS.
Re:The "hyperthreading" thing. (Score:2)
Re:The "hyperthreading" thing. (Score:2)
Alex
Bullshit (Score:1, Informative)
Win95/98/ME are not multiprocessor but are preemptive multitasking and multithreading. They can certainly do "more than one thing at a time". Unlike Apple who first shipped this capability only recently, MS first shipped this in Windows 386 back in the late 80's.
Re:The "hyperthreading" thing. (Score:2)
Is that really true? Judging by the web logs from my employer's site, it looks like about 65% of our users are on NT/2K/XP. Our customers are all in the construction industry, not the tech industry, so they aren't likely to be early adopters.
If you're talking MS's home users, then that's pretty plausible, but home users aren't the majority of Microsoft's installed base.
I'd be interested to see some numbers, though, if you
power consumption (Score:5, Interesting)
The main selling point of transmeta was always power consumption, so have they lost their edge in that area? If so, then that would be serious for them, but the article doesn't answer that question.
Re:power consumption (Score:2)
Re:power consumption (Score:1)
The article didn't answer the questi
Re:power consumption (Score:2)
No, they're still great for power consumption. Problem is, that the CPU isn't the only thing in most devices sucking power, and they built up expectations that their chips would be able to perform much better than they have turned out to do. I still think they are good choices for a lot of devices that don't really need any more power - they're basically like ARM with x86 compatibility built in, and there are plenty of cases where something like that makes sense - but they definately haven't lived up to the
Re:power consumption (Score:2)
When the project failed to do that (quite badly), then the marketeers refocussed the company message to start talking about 'low-power' and efficiency. This deflects the critics who do not understand computer architecture and things like power-efficiency.
Yeah, it's a neat research project and having Linus work there didn't hurt PR at all, but the performance just isn'
quick... (Score:1)
Why only two threads per core? (Score:3, Interesting)
I mean, the MTA supercomputer which pioneered the entire SMT concept, was able to run 128 threads per cpu. Ok, so they had different design constraints as well. Basically, the idea was that the cpu:s didn't have any cache at all thus making them simpler and cheaper. To avoid the performance hit usually associated with this they simply switched to another thread when one thread became blocked waiting for memory access.
Anyway, is there any specific reason why IBM didn't put more than 2, say 8 or 16 threads per cpu on the power5?
contexts != threads (Score:5, Informative)
first, you don't just automatically get a linear increase with the width of the multiple-threading capabilities. it's not like it's free to increase the RF size and/or FUs, etc.
you're also confusing contexts with active threads. the Tera^WCray MTA had 128 contexts available -- so that thread switching is more light-weight, more or less -- but only one could be active at one time.
SMT in the various forms have more than one active thread, which introduces the problem(s) of competing for resources in the issue and retire stages, etc et al.
Re:Why only two threads per core? (Score:3, Informative)
Subsequently, I don't know how much you've pla
Re:Why only two threads per core? (Score:3, Interesting)
1) Being able to FIND parallelism
2) Being able to take advantage of it:
a) Issuing multiple instructions (limited fetch bandwidth)
b) Executing them in parallel (limited FUs)
c) Committing them to memory / retiring
20% is generous, but that's a limitation of the simplicity of HT with respect to the EV8 / UltraSparc-V scale of SMT implementation, which leans towards a more full-issue design.
Re:Why only two threads per core? (Score:3, Insightful)
Re:Why only two threads per core? (Score:2)
Limited resources run out, hence four (independent) threads running in parallel cannot write to the RF or fetch from memory concurrently. If your parallelism involves many different types of operations, it's much easier.
I suppose my original comment was worded badly -- being *able to* HARNESS the inherent (independent) parallelism with the resources at hand is the key, you are correct.
Re:Why only two threads per core? (Score:2)
SunRay servers comes to mind, where there are lots of single-threaded users sharing a system.
In Solaris, for example, every process gets a kernel thread, and every process thread gets a kernel thread. On my workstation, right now, just running CDE and a few apps gets reported as 189 light-weight processes (essentially threads). Have a system shared by 1000 users could result in over 100,000 threads with approximately 100
Re:Why only two threads per core? (Score:3, Insightful)
The way this worked on the afforementioned MTA machine is th
Re:Why only two threads per core? (Score:2)
Re:Why only two threads per core? (Score:2)
Re:Why only two threads per core? (Score:2)
It is an older concept (20 years or maybe 30!), look up barrel processors sometime. I'm pretty sure the MTA executed one thread per CPU per cycle with no penality for switching between threads on diffrent cycles. It would switch threads any time a load was issued, any time the store buffer was full and a store was issued, and after X cycles. The resources you need for an MTA thread would be more
So, despite being lower voltage/MIPS... (Score:5, Interesting)
Re:So, despite being lower voltage/MIPS... (Score:3, Insightful)
it's a package of intel wireless, intel cpu and some other stuff.
I *know* "Centrino is not a chip" (Score:1)
My point is, this low-voltage thing was a non-issue before Transmeta came along. Intel just told everyone to "put bigger fans" in their laptops and shut up. I've got this Dell with seriuosly huge fans, and it gets HOT (but it's pretty durn fast, has a big screen and built in DVD/CD-RW). I don't need lo
memory and processor watts not the same (Score:5, Interesting)
Also in the article, the author suggests that processors spend most of their time wating on loads, and then argues that since the code-morphing approach means more instruction fetches, the Efficion processor will be spending disproportionatly more time on loads. Then, after this assertion, he admits that he does not know *where* the translated Efficion code is held. Might it be in one-cycle-accessible L1 cache? That point is conveniently sidestepped. He does not understand under what circumstances the profiling takes place, although he regurgitates the sales pitch nicely. He argues that transistors hold the translated code (trying to argue against the transistors-for-software tradeoff) but then does not realize that transistors in memory do not equate transistors in logic (neither in power, as they are not cycled as frequently, nor in speed characteristics).
In all, I find the author's treatment of the Transmeta architecture sophomoric, and, after finding that section lacking, I left the rest of the article unread. Your mileage may vary.
A very Good point (Score:2)
Re:memory and processor watts not the same (Score:5, Informative)
I neither suggest nor imply anything this simplistic. In fact, I go to great pains to show how complicated the whole power picture is for Efficeon.
"This belies a deep misunderstanding of power consumption in digital systems, as readily evidences by the fact that modern non-Transmeta processers dissipate multiple tens of Watts of power (often nearly 100W) and a full complement of memory (4G, in modern machines) dissipates a few Watts at most."
Er... you do realize, don't you, that comparing Efficeon to a 100W processor is not only unfair, but it's stupid and I didn't do it anywhere in the article. A more appropriate comparison is Centrino, which approaches Efficeon in MIPS/Watt without any help at all from any kind of CMS software. I think that you might be the one who needs to learn a bit more about digital systems.
"Also in the article, the author suggests that processors spend most of their time wating on loads, and then argues that since the code-morphing approach means more instruction fetches, the Efficion processor will be spending disproportionatly more time on loads. Then, after this assertion, he admits that he does not know *where* the translated Efficion code is held. Might it be in one-cycle-accessible L1 cache? "
No, it is most certainly all not stored in L1. TM claimed that the original CMS software that came with Crusoe took up about 16MB of RAM, and that this was paged in from a flash module on boot. What I'm not 100% certain of are the exact specs for Efficeon, but I've assumed in this article that they're similar. This is a reasonable assumption, especially given the fact that the new version of CMS contains significant enhancements and is unlikely to be smaller. In fact, it's much more likely to be larger than the original 16MB CMS footprint, especially given that DRAM modules have increased in speed and decreased in cost/MB, which gives TM more headroom and flexibility to increase the code size a bit.
"That point is conveniently sidestepped. He does not understand under what circumstances the profiling takes place, although he regurgitates the sales pitch nicely. He argues that transistors hold the translated code (trying to argue against the transistors-for-software tradeoff) but then does not realize that transistors in memory do not equate transistors in logic (neither in power, as they are not cycled as frequently, nor in speed characteristics)."
Of course I know that transistors in memory are not the same as transistors on the CPU. My point though is that they're still not "free" in terms of power draw, and that it also costs power to both page CMS into RAM and to move it from RAM to the L1. And even having pointed that out, I still don't claim that this cancells out all the power saving advantages of TM's approach.
As far as relying on the sales pitch for info on CMS's profiling, well, TM doesn't exactly release the source for CMS, nor do they provide a detailed user manual for it avialable to the public. As their core technology, details about CMS are highly guarded and the only information that either you or I will likely ever have access to about it is whatever they put in the sales pitch. So I, like everyone else, must draw inferences from their presentations and do the best I can.
Anyway, if you don't like the article, that's fine. But being a hater about it just makes you look lame.
Re:memory and processor watts not the same (Score:4, Insightful)
It is true, that the CMS has a cost in terms of RAM usage but this does not necessarily translate into extra load latency. As I have understood the clue should be to utilize the fact that in common code you only execute a very little portion of the code most of the time (like 90%/10% or whatever). It should be expected that much can be gained by heavily optimizing these "inner loops", which should translate into reduced load latency as fewer instruction will be executed in total. The execution of the four optimisation runs or JIT compilation should drown in the millions of times these inner loops are executed.
You could say that it is a complete waste of transistors and power usage to have many transistors performing the same optimisation over and over again in the conventional processors. These hardware based optimisation will also never be as efficient as their scope is limited.
There are some interesting perspectives with the Transmeta approach as well. You state that POWER5, UltraSparcIV and Prescott tacle the problem with load latency by using SMT to fill pipeline bubbles from data stalls and thereby increase utilisation of the execution units. This should be possible for Transmeta as well, by upgrading their CMS to emulate two logic processors instead of one.
But you are right! A complete theoretical comparison is impossible - only real world experience will show...
Re:memory and processor watts not the same (Score:2)
CMS and its translation buffer takes a small fraction of the available RAM, and all of RAM takes a small fraction of the power the CPU does, so we're talking about a fraction of a fraction. Translations live in RAM, btw, and are cached like any other executable code, when needed.
Re:as does using the word "hater" (Score:1, Funny)
Re:memory and processor watts not the same (Score:2)
The Pentium 4 has upwards of about 55 million transistors on the die. SDRAM needs 1 transistor and 1 capacitor per bit; for 8x1024x1024x1024 bits
Re:memory and processor watts not the same (Score:3, Insightful)
DDR SDRAM does not "run" at around 400MHz - the frequency of the databus is 400MHz. As you state yourself the power usage is very dependant on the usage pattern and only very few memory cells actualle change state during each write (up to 8 for an 8 bit RAM). I would guess that leakage and discharge of the capacitor cells is a significant factor, which you totally ignore.
In a processor on the other hand, a lot of transistors ch
Re:memory and processor watts not the same (Score:2)
Yes, except that the fraction of transistors switching in the two at any given moment is vastly different: in the P4 it will be reasonably high, in memory chips, it will be vanishingly low. Thus your analysis is inaccurate at best and potentially misleading at worst.
Think of the following empirical observations: a modern processor cannot run without a heatsink without going into thermal failure.
Trying to read the FA (Score:1)
So, I gave up. I have no clue what the advert was for, it had a sort of minimalist man icon in it, and lots of flashing colours - that's all I know. I do however know a lot more about advertising than the idiots who thought that one up.
Simon.
I need some explanations (Score:1, Interesting)
If the number of decoded instructions is higher, then - the CPU being superscalar - the probability of having all pipelines working grows, which means that ILP's also going up.
Of course the ILP depends on the compiler quality and the program code itself, but having a good parallelism capacity in the CPU is also a key factor.
Who is this Arse ... (Score:2)
Efficeon has integrated northbridge (Score:2, Redundant)
Efficeon allows for a low chip count design. That could mean a smaller and more reliable laptop design.
Guess what??? (Score:5, Funny)
All my questions were answered so I have nothing to say.
Why is Transmeta still in the picture? (Score:1, Insightful)
Until Transmeta becomes a real contender, let's just keep out of the Linux biases and concentrate on the real contenders.
My prediction is that if they don't produce a real h
Perhaps because (Score:3, Insightful)
Code has to be loaded anyway (Score:3, Insightful)
You make a rather big deal about Transmeta needing to run all x86 code through a "code morpher" (dynamic recompiler, actually), and come up with a decently large set of conclusions based on it.
What's the big deal? No processor executes raw x86 anymore. Everything translates into an internal microcode that bears little resemblance to the original asm. Of course, normal chips have hardware accelerated microcode translaters, whereas Transmeta must recode in software -- but Transmeta's entire architecture was designed from day one to do that, and concievably they have more context available to do recoding by involving main memory in the process.
And what is it with you neglecting the equivalence of main memory? Yes, transistors are necessary to store the translated program. They're also necessary to store the original one -- the Mozilla client I'm presently tapping away inside sure as hell doesn't fit in L1 on my C3! Outside of a small static penalty on load, and a smaller dynamic penalty from ongoing profiling, you can't blame performance on the fact that software needs to be in RAM. Software always needs to be in RAM.
Don't get me wrong -- Transmeta's a performance dog, and everyone's known that since day one. But I think it's reasonable to say the cause is mostly one of attention -- every man hour they threw into allowing the system to emulate x86 took away from adding pipelines, increasing clock rates, tweaking caches, etc. In other words, yes it's a feat that they got the code to work, but you don't need to blame the feat for the quality of work -- they simply did alot of work nobody else had to waste time on, and fell behind because of it.
Much easier explanation. Might even be true.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
Re:Code has to be loaded anyway (Score:4, Informative)
While Archie is undoubtedly an ugly, drunk screw-up, he's really a droplet in the ocean of effort that goes into a competitive CPU implementation. Yeah, we've got lots of code to deal with him, and he's an ongoing source of work, but not all that much code, nor that much work. If Archie were really such a terrible guy, it wouldn't be possible for Intel and AMD to be eating so many RISC vendors' lunches.
Mike Johnson, the lead x86 designer at AMD, probably put it most succinctly when he said, "The x86 isn't all that complex -- it just doesn't make a lot of sense." It's peculiar all right, but not so peculiar that it can explain Transmeta's failure to be performance competitive. From speaking with Transmetans, I get the strong impression that they got bogged down because making a high performance dynamic translation system is ridiculously hard, rather than, say, because they just couldn't get the growdown segment descriptors right.
Re:Code has to be loaded anyway (Score:2)
Except for really tight inner loops, you're always flying off to system RAM for one thing or another. While there's a static penalty because of code morphing, I'd wager it's a "lost in the backwash" effect -- oh, so a given stream of ops took a few extra million cycles to start cranking. BFD; we've got half a billion of 'em per second. The real question is why we don't have a f
Sweeping generalizations (Score:2, Interesting)
I like this assessement. Forget about Moore's Law as a measure of our progress; latency and throughput are far more important than processing power.
Computers used to be for processing information; these days, most people use
We need an Ars Technica logo here. (Score:2)
I wonder what would be better... (Score:2)
By the way, I would like to have a computer that has SRAM only and a bandwidth of 100 GB/sec...Is it possible, with current technology ?
Re:Great Innovation (Score:1)
Is that not the saddest form of life you've ever heard of?