Intel Says to Prepare For "Thousands of Cores" 638
Impy the Impiuos Imp writes to tell us that in a recent statement Intel has revealed their plans for the future and it goes well beyond the traditional processor model. Suggesting developers start thinking about tens, hundreds, or even thousand or cores, it seems Intel is pushing for a massive evolution in the way processing is handled. "Now, however, Intel is increasingly 'discussing how to scale performance to core counts that we aren't yet shipping...Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander,' [Anwar Ghuloum, a principal engineer with Intel's Microprocessor Technology Lab] said. He says that the more radical programming path to tap into many processing cores 'presents the "opportunity" for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they've adhered to for (often) most of the their software's existence.'"
Not Sure I'm Getting It (Score:5, Insightful)
Good idea (Score:5, Insightful)
It's a good idea.. Somewhat of the same idea that the Cell chip has going for it (and well, Phenom X3s). You make a product with lots of redunant objects so that when some are bound to failure, the percentage of failure is much lower..
If there are 1000 cores on a chip, and 100 go bad... You're still only losing a *maximum* of 10% of performance versus when you have 2 or 4 cores and 1 or 2 go bad, you have a performance impact of 50% essentially.. Brings costs down because yeilds go up dramatically.
Re:Memory bandwidth? (Score:2, Insightful)
I would assume that if you have enough transistors to have thousands of cores that you will be able to put on a lot of SRAM cache as well - just drop a few hundred or thousand cores. You won't be able to integrate DRAM since it requires a different process, but SRAM should be integrated easily enough.
Re:Useless (Score:5, Insightful)
Re:Not Sure I'm Getting It (Score:3, Insightful)
Re:Not Sure I'm Getting It (Score:5, Insightful)
As a software engineer, I wonder the same thing.
Put simply, the majority of code simply doesn't parallelize well. You can break out a few major portions of it to run as their own threads, but for the most part, programs either sit around and wait for the user, or sit around and wait for hardware resources.
Within that, only those programs that wait for a particular hardware resource - CPU time - Even have the potential to benefit from more cores... And while a lot of those might split well into a few threads, most will not scale (without a complete rewrite to chose entirely different algorithms - If they even exist to accomplish the intended purpose) to more than a handful of cores.
Re:We all saw it coming anyway (Score:5, Insightful)
"So whether programmers find this move acceptable or not is irrelevant because this path is probably the only way to design faster CPU:s once we've hit the nanometer wall."
I guess you should put "faster" in quotes.
In any case, it is absolutely relevant what programmers think since any performance improvements that customers actually experience is dependent on our code.
Historically a primary reason to buy a new computer is because a faster system makes legacy applications run faster. To a large extent this won't be true with a new multicore PC. So why would people buy them?
That's why Intel wants us to redesign our software - so that in the future their customers will still have a reason to buy a new PC with Intel Inside.
Re:Memory bandwidth? (Score:3, Insightful)
Re:Not Sure I'm Getting It (Score:4, Insightful)
Re:Not Sure I'm Getting It (Score:5, Insightful)
That is what most current processors do and use branch prediction for. Even if you have a thousand cores, that's only 10 binary decisions ahead. You need to guess really well very often to keep your cores busy instead of syncing. Also, the further you're executing ahead, the more ultimately useless calculations are made, which is what drives power consumption up in long pipeline cores (which you're essentially proposing).
In reality parallelism is more likely going to be found by better compilers. Programmers will have to be more specific about the type of loops they want. Do you just need something to be performed on every item in an array or is order important? No more mindless for-loops for not inherently sequential processes.
Re:Disagreement about this trend (Score:5, Insightful)
My guess is 4 cores in 2008, 4 cores in 2009, moving to 8 cores through 2010. We may move to a new uber-core model once the software catches up, more like 6-8 years than 2-4. I'm positive we won't "max out" at 64 cores, because we're going to hit a per-core speed limit much more quickly than we hit a number-of-cores limit.
It's all changing too fast (Score:3, Insightful)
I've only been programming professionally for 3 years now, but already I'm shaking in my boots over having to rethink and relearn the way I've done things to accomodate these massively parallel architectures. I can't imagine how scared must be the old timers of 20, 30, or more years. Or maybe the good ones who are still hacking decades later have already had to deal with paradigm shifts and aren't scared at all?
Re:Not Sure I'm Getting It (Score:5, Insightful)
From a practical standpoint, Intel is right that we need vastly better developer tools and that most things that require ridiculous amounts of compute time can be parallized if you put some effort into it.
Re:Not Sure I'm Getting It (Score:5, Insightful)
I concur, furthermore I'd like to see one core per pixel, that would certainly solve your high-end gaming issues.
Re:Memory bandwidth? (Score:5, Insightful)
Memory would have to be completely redefined. Currently, you have one memory bank that is effectively accessed serially.
Yes, in Intel land. AMD has this thing called NUMA. What do you think "HyperTransport" means?
Re:Disagreement about this trend (Score:5, Insightful)
Web applications are becoming more AJAX'y all the time, and they are not sequential at all. Watching a video while another tab checks my Gmail is a parallel task. All indications are that people want to consume more and more media on their computers. Things like the MLB mosaic allow you to watch four games at once.
Have you ever listened to a song through your computer while coding, running an email program, and running an instant messaging program? There are four highly parallelizable tasks right there. Not compute intensive enough for you? Imagine the song compressed with a new codec that is twice as efficient in terms of size but twice as compute intensive. Imagine the email program indexing your email for efficient search, running algorithms to assess the email's importance to you, and virus checking new deliveries. Imagine your code editor doing on the fly analysis of what you are coding, and making suggestions.
"Normal" users are doing more and more with computers as well. Now that fast computers are cheap, people who never edited video or photos are doing it. If you want a significant market besides gamers who need more cores, it is people making videos, especially HD videos. Sure, my Grandmother isn't going to be doing this, but I do, and I'm sure my children will do it even more.
And don't forget about virus writers. They need a few cores to run on as well!
Computer power keeps its steady progress higher, and we keep finding interesting things to do with it all. I don't see that stopping, so I don't see a limit to the number of cores people will need.
Re:Memory bandwidth? (Score:2, Insightful)
You need a basic course in TTL. No they haven't figured this out and putting address decoded on the chip makes very little difference when you scale. They also haven't figured out communication between cores. We had 1000s of CPUs rigged up with transputers back in the 80s. It was a mare, and near useless for just about everything. We had to use serial data to make things sane.
The more logic you have the longer the signal path. The longer the signal path the hard it is to sync on the clock pulse. The higher the clock freq the less like a square wave the single is, it starts to look like a ramp.
There are huge problems with scaling, whether it's speed or cores. If Intel want us to have all these cores, their engineers are going to have to overcome the same problems parallel programming has had for 30 year or more.
Re:Not Sure I'm Getting It (Score:5, Insightful)
Re:Disagreement about this trend (Score:3, Insightful)
I KNOW it is so very often sited but if every was a time to mention the "5 computers in the whole world" it is this. In fact, I would dare say that is the whole point of this push by Intel: trying to get people (programmers) used to the thought of having so many parallel cpus in a home computer.
Sure, from where we stand now, 64 seems like a lot but maybe a core for nearly each pixel on my screen makes sense, has real value to add. Or how about just flat-out smarter computers, something which might happen by simulating 100 neurons per core. As far as I understand it, speech recognition can always use more power. Let me put it differently:
Games requiring a lot of computing power makes sense to you in the future but not elsewhere. The same would have been said about a high end gaming rig just a handful of years ago, and yeta low-end PC today has amazing graphics,amazing everything, compared to what things were just 10 years ago. And it gets used, much of the time. If we have the power, we will use it. Games just push the envelope further, sooner, but they don't go anywhere that we all wouldn't wouldn't like to go anyways.
I can not think of a single task in a game that I would not want to be able to do in real life. Games are about living an idealized life, of some sort, inside your computer. The next step is bring it our here, to the rest of the world.
Re:Not Sure I'm Getting It (Score:4, Insightful)
Obviously just adding more cores does little to speed up individual sequential processes, but it does help with multitasking, which is what I really think is the "killer app" for multi-core processors.
Back in the late 90's (it doesn't feel like "back in.." yet but I'm willing to admit that it was about a decade ago) I decided to build a computer with an Abit BP6 motherboard, two Celeron processors and lots of RAM instead of a single higher end processor because I wanted to be able to multitask properly, my gamer friends mocked me for choosing Celeron processors but for the price of a single processor system I got a system that was capable of running several "normal" apps and one with heavy cpu usage without slowing down the system, and the extra RAM also helped (I saw lots of people back then go for 128 MB of RAM and a faster CPU instead of "wasting" their money on RAM, and then they cursed their computer for being slow when it started swapping). There was also the upside of having Windows 2000 run as fast on my computer as Windows 98 did on my friends' computers...
/Mikael
Re:Not Sure I'm Getting It (Score:4, Insightful)
Are you crazy? Context switches are the slowdown in multitasking OSes.
Unfortunately, multitasking OSes are not the slowdown in most tasks, exceptions noted of course.
Re:Not Sure I'm Getting It (Score:5, Insightful)
At the moment, I'm looking at Slashdot in Firefox, while listening to an mp3. I'm only using two out of my four cores, and I have 3% CPU usage.
Maybe when I post this, I might use a third core for a little while, but how many cores can I actually usefully use.
I can break a password protected Excel file in 30 hours max with this computer, and a 10000 core chip might reduce this to 43 seconds, but other than that, what difference is it going to make?
you mean SGI (Score:5, Insightful)
SGI and or Cray were using NUMA a decade ago.
Re:It's all changing too fast (Score:5, Insightful)
My dad's been programming for decades, and he's much more used to paradigm shifts than I am. His first programming job was translating assembly from one architechture to another, and now he's a proficient web developer. He understands concurrency and keeps up to date on new developments.
I'm reminded of an anecdote told to me during a presentation. The presenter had been introducing a new technology, and one man had a concern: "I've just worked hard to learn the previous technology. Can you promise me that, if I learn this one, it will be the last one I ever have to learn?" The presenter replied, "I can't promise you that, but I can promise you that you're in the wrong profession."
Re:Not Sure I'm Getting It (Score:5, Insightful)
Before having 1 core was enough, and having 512mb of RAM was enough for most consumers. Computing power grows, and software developers makes use of that additional power. However, this will mainly effect the gaming industry.
Re:Not Sure I'm Getting It (Score:4, Insightful)
Uh, last time I checked, Python had a single interpreter lock per process which made it unsuitable for heavily multithreaded programs. Java would be a better example of a scalable and multithread-aware language.
Difference (Score:3, Insightful)
What's different this time may be that nobody else has anything better. Last time, AMD64 was the easier solution, and it clobbered Itanium. Can AMD (or anybody) simply choose to keep making single cores faster, or is multi-core the way CPUs really must go from here?
Re:Not Sure I'm Getting It (Score:3, Insightful)
I disagree. Having the compiler analyze loops to find out if they are trivially parallelizable is easy, there's little need to change the language.
On the other hand, a language that was really designed for kilocores or megacores would be radically different from most modern languages, adding a few extra (un)loop-statements wouldn't do. Functional languages are a good bet. When everything is side-effect-free, there's no good reason why all of it can't be executed in parallel.
But maybe we need even more abstraction. And more time. It took quite a while after the invention of the programmable computer for someone to invent FORTRAN. And we still program in something resembling FORTRAN. Maybe what we really need are actual many-core computers so that someone really smart will use them, and finally figure out a way to program them that's practical. That's where I'll put my money. Wait and see!
Re:Not Sure I'm Getting It (Score:5, Insightful)
This is, IMHO, the wrong question to be asking. Asking how current tasks will be optimized to take advantage of future hardware makes the fundamental flawed assumption that the current tasks will be what's considered important once we have this kind of hardware.
But the history of computers have shown that the "if you build it, they will come" philosophy applies to the tasks that people end up wanting to accomplish. It's been seen time and again that new abilities for using computers wait until we've hit a certain performance threshold, whether it CPU, memory, bandwidth, disk space, video resolution or whatever, and then become the things we need our computers to do.
Take, for instance, the huge success of mp3's. There was a time not so long ago when people were limited to playing music off a physical CD. This wasn't because there was no desire amongst computer users to listen to digital files that could be stored locally or streamed off the internet. It was because computer users did not know yet that they had the desire to do it. But technology advanced to the point where a) processors became fast enough to decode mp3's in real time without using the whole CPU and b) hard drives grew to the point where we had the capacity to store files that are 10% of the size of the size of the files on the CD.
Similarly, it's likely that when we reach the point where we have hundreds or thousands of cores, new tasks will emerge that take advantage of the new capabilities of the hardware. It may be that those tasks are limited in some other way by one of the other components we use or by the as yet non-existent status of some new component, but it's only important that multiple cores play a part in enabling the new task.
In the near term, you can imagine a whole host of applications that would become possible when you get to the point where the average computer can do real-time H.264 encoding without affecting overall system performance. I won't guess at what might be popular further down the road, but there will be people who will think of something to do with those extra cores. And, in hindsight, we'll see the proliferation of cores as enabling our current computer-using behavior.
Re:Not Sure I'm Getting It (Score:5, Insightful)
Why wouldn't each core have it's own cache? It only needs to cache what it needs for its job.
so, Intel made risc passé... (Score:3, Insightful)
and now they're bringing it back?
we all learned how 1000 cores doesn't matter if each core can only process a simplified instruction set compared to 2 cores that can handle more data per thread.
this is basic computer design here people.
Re:Not Sure I'm Getting It (Score:3, Insightful)
"Unfortunately all this is going to lead to bus and memory bandwidth contention, "
Good. Current bus needs to be redone.
Re:Not Sure I'm Getting It (Score:3, Insightful)
except when running an algorithm on 1 core, you can have 900 cores running different outputs based on the probability of a different out come of the previous part of the process.
WHen it is actually determined, kill the 899 that wher incorrect. In fact, what would probably happen is they would all branch differently, so you might kill 400, then after running for a bit, 200, and so on. This would exponentially decrease the time it takes to solve it.
In fact, for some application getting 'close enough' will do.
Example:
Chess. I move my pawn in the first move in chess. 18 processes started up on separate cores, each one calculating the next 5 steps that are possible. When the next mover is made, it kills the processes that didn't calculate 5 steps from that move.
Re:Hey remember the 1980's and the Amiga? (Score:2, Insightful)
How is that back to the Amiga?
The PC platform hit Amiga levels well over a decade and a half ago, with dedicated graphics hardware, dedicated audio hardware, dedicated network hardware, a numerical coprocessor, and so on. People need to stop claiming every new change finally brings things back to the Amiga. That argument is terribly old.
And yeah I was into the Amiga and Atari ST and Mac Classic back in those days, but then I moved on.
Re:It's all changing too fast (Score:3, Insightful)
We're not scared. All the good ones spit in to their hands, brace themselves and say "Bring it on."
Any old timers actually scared needs to leave, and don't let your beard get caught in the door on the way out, wuss.
Don't worry about relearning, by the time this hits the market, tools will ahve been written, and there will ahve been a lot of documentation.
It's going to be a great step in computing... Or it will get killed becasue the tools weren't developed fast enough.
Re:Not Sure I'm Getting It (Score:3, Insightful)
You speed it up by rewriting sequential algorithms to run in parallel. It is surprising the number of algorithms you would swear are inherently sequential that can be rewritten to operate in parallel. Beyond that, you can have cores engaged in speculative execution, where the results may or may not be used. I could imaging a spell checker where multiple words and sentence fragments are dispatched to numerous cores for spelling/grammar checking. A compiler could devote a separate core to compiling/linking/optimizing each individual module or function.
Programmers don't think massively parallel and most programming languages (excluding hardware design languages such as Verilog/VHDL) are sequential in nature.
Re:Not Sure I'm Getting It (Score:3, Insightful)
You still choke on the Memory Wall [wikipedia.org]; you have to feed all those cores data, and you're going a few orders of magnitude slower than the CPU cores. Increasing bandwidth on the front side bus doesn't help, as you have to increase bandwidth and decrease latency. You compound this when you have many cores/sockets doing backward cache flushes to RAM.
Even if you've got a hypertransport link (as Intel doesn't, they push bits on the front side bus between sockets, IIRC) to the north bridge for each socket, you've still only got a single north bridge. You're bottlenecked again. OK, use two front side buses with an interlink. Now we're back to coherency problems, but at two points. At some point, you have to either give each socket its own RAM bank (NUMA) and isolate data (and make CPU migration for tasks take an extra hit) or figure out how to perfectly isolate and stripe your data over multiple paths to a single backing store.
Re:Yeah, right. (Score:3, Insightful)
The notion that some revolutionary compiler or IDE is going to solve this problem is just wrong. Tell it to Itanic, that was based on exactly these assumptions and failed miserably because of them.
With Itanium, they were trying to say compiler improvements could handle it invisibly, with no work from the application programmers. Taking advantage of more than two cores (since one can take care of other programs that would have slowed down your app) is going to take conscious thought about what can and can't be parallel. Taking advantage of more than a handful is going to take more fundamental shifts in how we program. They're asking a lot more this time.
On the other hand, you could easily opt out of Itanium. Now, this is the only way your programs are going to get much future processing improvement. Ever. No matter who you're buying CPUs from.
Re:Not Sure I'm Getting It (Score:5, Insightful)
As a software engineer, I wonder the same thing.
Put simply, the majority of code simply doesn't parallelize well. You can break out a few major portions of it to run as their own threads, but for the most part, programs either sit around and wait for the user, or sit around and wait for hardware resources.
Within that, only those programs that wait for a particular hardware resource - CPU time - Even have the potential to benefit from more cores... And while a lot of those might split well into a few threads, most will not scale (without a complete rewrite to chose entirely different algorithms - If they even exist to accomplish the intended purpose) to more than a handful of cores.
As a software engineer you should know that "most code doesn't parallelize" is very different from "most of the code's runtime can't parallelize", as code size and code runtime are substantially different things.
Look at most CPU intensive tasks today and you'll notice they all parallelize very well: archiving/extracting, encoding/decoding (video, audio), 2D and 3D GUI/graphics/animations rendering (not just for games anymore!), indexing and searching indexes, databases in general, and last but not least, image/video and voice recognition.
So, while your very high-level task is sequential, the *services* it calls or implicitly uses (like GUI rendering), and the smaller tasks it performs, actually would make a pretty good use of as many cores as you can throw at them.
This is good news for software engineers like you and me, as we can write mostly serial code and isolate slow tasks into isolated routines that we write once and reuse many times.
Re:Not Sure I'm Getting It (Score:3, Insightful)
Why "before" ? I think 512Mb RAM / 1 or 2 GHz + decent speedy harddrive IS enough for most consumers, playing (moderately recent) games (maybe upgrading to a newer $50 video card), playing (moderate) HD, MP3, browsing sites, any office work usings lots of ajax/ on FF3.
You know what ? you could even (gasp) code on it (maybe not compile eclipse every 5 minutes, OK), run a small server on it, or transcoding videos (maybe 4x more slowly, so you'll end up letting it run for the night instead of 2 hours from time to time. big deal)
Of course, SOME people might need more. For most of us, 512Mb/1x2GHz is perfectly enough (see eeePC).
Missing the point (Score:3, Insightful)
CPU clock speeds ran into the brick wall a few years ago. Here is a chart showing CPU clocks from 1993 to 2005. [tomshardware.com]
There have been no major performance improvements from that direction for the last few years, and probably won't be any more without a major breakthrough in semiconductors.
Moore's law is about transistor counts, and shows no real signs of stopping. Every 18 to 24 months, we double the number of transistors on a given wafer/die. The transistion to 64 bit CPUs used a generation or two of those extra transistors, but we aren't likely to move to 128 bits soon. We are already pretty deep into the diminishing-returns curve for on-die cache.
What is left to consume those transistors?
More cores. Lots more cores. If you replace your CPU every 2 years, you can pretty much bet that each one you buy for the next decade or so will have twice as many cores as the one it is replacing.
And if developers and compilers get good at managing parallel code (and they have no choice in this), you can expect core counts to go up even faster than doubling ever couple of years.