Slashdot Log In
Intel Says to Prepare For "Thousands of Cores"
Posted by
ScuttleMonkey
on Wed Jul 02, 2008 03:42 PM
from the viva-la-coding-revolucion dept.
from the viva-la-coding-revolucion dept.
Impy the Impiuos Imp writes to tell us that in a recent statement Intel has revealed their plans for the future and it goes well beyond the traditional processor model. Suggesting developers start thinking about tens, hundreds, or even thousand or cores, it seems Intel is pushing for a massive evolution in the way processing is handled. "Now, however, Intel is increasingly 'discussing how to scale performance to core counts that we aren't yet shipping...Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander,' [Anwar Ghuloum, a principal engineer with Intel's Microprocessor Technology Lab] said. He says that the more radical programming path to tap into many processing cores 'presents the "opportunity" for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they've adhered to for (often) most of the their software's existence.'"
Related Stories
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
The thing's hollow - it goes on forever (Score:5, Funny)
- and - oh my God - it's full of cores!
Imagine a Beowulf cluster.... (Score:5, Funny)
oh nevermind, what's the point?
Parent
it's.... (Score:5, Funny)
OVER 9000!!!!!!11111one
Parent
Re:The thing's hollow - it goes on forever (Score:5, Funny)
Don't give up! Stay the cores!
Parent
Re:The thing's hollow - it goes on forever (Score:5, Funny)
No, not quite. It's CORES all the way down!
Parent
Re:The thing's hollow - it goes on forever (Score:5, Funny)
Parent
Not Sure I'm Getting It (Score:5, Insightful)
Re:Not Sure I'm Getting It (Score:5, Informative)
Then you take the tasks that can be broken up over multiple cores (Ray Tracing anyone?) and fill the rest of your cores with that.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
From a practical standpoint, Intel is right that we need vastly better developer tools and that most things that require ridiculous amounts of compute time can be parallized if you put some effort into it.
Parent
Re:Not Sure I'm Getting It (Score:5, Informative)
True but misleading. The major cost of task switching is a hardware-derived one. It's the cost of blowing caches. The swapping of CPU state and such is fairly small by comparison, and the cost of blowing caches is only going up.
Parent
Re:Not Sure I'm Getting It (Score:5, Interesting)
Now that 64-bit processors are so common, perhaps operating systems can spare some virtual address space for performance benefits.
The OPAL operating system [washington.edu] was a University of Washington research project from the 1990s. OPAL uses a single address space for all processes. Unlike Windows 3.1, OPAL still has memory protection and every process (or "protection domain") has its own pages. The benefit of sharing a single address space is that you don't need to flush the cache (because the virtual-to-physical address mapping do not change when you context switch). Also, pointers can be shared between processes because their addresses are globally unique.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
Why wouldn't each core have it's own cache? It only needs to cache what it needs for its job.
Parent
Re:Not Sure I'm Getting It (Score:5, Interesting)
yes, but if you have 1000 cores each with 64k of cache, then you start to run into problems with memory throughput when computing massively parallel data.
memory throughput has been the achilles heel of graphic processing for years now. and as we all know, splitting up a graphic screen into smaller segments is simple. so GPUs went massively parallel long before CPUS, in fact you will soon be able to get over 1000 stream processing units in a single desktop graphic card.
so, the real problem is memory technology, how can a single memory module consistently feed 1000 cores, for instance if you want to do real-time n-pass encoding of a hd video stream... while playing a FPS online, and running IM software, and a strong anti-virus suite...
I have a horrible horrible ugly feeling that you'll never be able to get a system that can reliably do all that. at the same time, just because they'll skimp on memory tech or interconnects, so you'll have most of the capabilities of a 1,000 core system wasted.
Parent
Re:Not Sure I'm Getting It (Score:5, Informative)
No. I/O is the slowdown in multitasking OSes.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
I concur, furthermore I'd like to see one core per pixel, that would certainly solve your high-end gaming issues.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
At the moment, I'm looking at Slashdot in Firefox, while listening to an mp3. I'm only using two out of my four cores, and I have 3% CPU usage.
Maybe when I post this, I might use a third core for a little while, but how many cores can I actually usefully use.
I can break a password protected Excel file in 30 hours max with this computer, and a 10000 core chip might reduce this to 43 seconds, but other than that, what difference is it going to make?
Parent
Re:Not Sure I'm Getting It (Score:5, Funny)
I can break a password protected Excel file in 30 hours max with this computer, and a 10000 core chip might reduce this to 43 seconds, but other than that, what difference is it going to make?
29 hours 59 minutes 17 seconds?
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
This is, IMHO, the wrong question to be asking. Asking how current tasks will be optimized to take advantage of future hardware makes the fundamental flawed assumption that the current tasks will be what's considered important once we have this kind of hardware.
But the history of computers have shown that the "if you build it, they will come" philosophy applies to the tasks that people end up wanting to accomplish. It's been seen time and again that new abilities for using computers wait until we've hit a certain performance threshold, whether it CPU, memory, bandwidth, disk space, video resolution or whatever, and then become the things we need our computers to do.
Take, for instance, the huge success of mp3's. There was a time not so long ago when people were limited to playing music off a physical CD. This wasn't because there was no desire amongst computer users to listen to digital files that could be stored locally or streamed off the internet. It was because computer users did not know yet that they had the desire to do it. But technology advanced to the point where a) processors became fast enough to decode mp3's in real time without using the whole CPU and b) hard drives grew to the point where we had the capacity to store files that are 10% of the size of the size of the files on the CD.
Similarly, it's likely that when we reach the point where we have hundreds or thousands of cores, new tasks will emerge that take advantage of the new capabilities of the hardware. It may be that those tasks are limited in some other way by one of the other components we use or by the as yet non-existent status of some new component, but it's only important that multiple cores play a part in enabling the new task.
In the near term, you can imagine a whole host of applications that would become possible when you get to the point where the average computer can do real-time H.264 encoding without affecting overall system performance. I won't guess at what might be popular further down the road, but there will be people who will think of something to do with those extra cores. And, in hindsight, we'll see the proliferation of cores as enabling our current computer-using behavior.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
Before having 1 core was enough, and having 512mb of RAM was enough for most consumers. Computing power grows, and software developers makes use of that additional power. However, this will mainly effect the gaming industry.
Parent
Bill gates was just mis-quoted (Score:5, Funny)
Parent
Re:Not Sure I'm Getting It (Score:5, Informative)
"Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.
Then you take the tasks that can be broken up over multiple cores (Ray Tracing anyone?) and fill the rest of your cores with that."
Unfortunately all this is going to lead to bus and memory bandwidth contention, you're just shifting the burden from one point to another. Although their is a 'penalty' for task switching, there is an even greater bottleneck at the bus and memory bandwidth level.
IMHO intel would have to release a cpu on a card with specialized ram chips and segment the ram like GPU's do to get anything out of multicore over the long term, ram is not keeping up and the current architecture for PC ram is awful for multicore. CPU speed is far outstripping bus and memory bandwidth. I am quite dubious of multi-core architecture, there is fundamental limits of geometry of circuits. I'd be sinking my money into materials research not glueing cores together and praying CS and math guys come up with solutions that take advantage of it.
The whole of human history of engineering and tool use, is to take something extremely complicated and offload complexity, and compartmentalize it so that it's mangable. I see the opposite happening with multi-core.
Parent
Re:Not Sure I'm Getting It (Score:5, Informative)
Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.
OK, so now the piece that's running on each core runs really really fast . . . until it needs to wait for or communicate with the piece running on some other core. If you can do your piece in ten instructions but you have to wait 1000 for the next input to come in, whether it's because your neighbor is slow or because the pipe between you is, then you'll be sitting and spinning 99% of the time. Unfortunately, the set of programs that decompose nicely into arbitrarily many pieces that each take the same time (for any input) doesn't extend all that far beyond graphics and a few kinds of simulation. Many, many more programs hardly decompose at all, or still have severe imbalances and bottlenecks, so the "slow neighbor" problem is very real.
Many people's answer to the "slow pipe" problem, on the other hand, is to do away with the pipes altogether and have the cores communicate via shared memory. Well, guess what? The industry has already been there and done that. Multiple processing units sharing a single memory space used to be called SMP, and it was implemented with multiple physical processors on separate boards. Now it's all on one die, but the fundamental problem remains the same. Cache-line thrashing and memory-bandwidth contention are already rearing their ugly heads again even at N=4. They'll become totally unmanageable somewhere around N=64, just like the old days and for the same reasons. People who lived through the last round learned from the experience, which is why all of the biggest systems nowadays are massively parallel non-shared-memory cluster architectures.
If you want to harness the power of 1000 processors, you have to keep them from killing each other, and they'll kill each other without even meaning to if they're all tossed in one big pool. Giving each processor (or at least each small group of processors) its own memory with its own path to it, and fast but explicit communication with its neighbors, has so far worked a lot better except in a very few specialized and constrained cases. Then you need multi-processing on the nodes, to deal with the processing imbalances. Whether the nodes are connected via InfiniBand or an integrated interconnect or a common die, the architectural principles are likely to remain the same.
Disclosure: I work for a company that makes the sort of systems I've just described (at the "integrated interconnect" design point). I don't say what I do because I work there; I work there because of what I believe.
Parent
Re:Not Sure I'm Getting It (Score:5, Funny)
Parent
Re:Not Sure I'm Getting It (Score:5, Funny)
My friends and I have lots of conversations about girls, how to get girls, how to please girls.
What, haven't you guys heard of simulation?
Parent
Re:Not Sure I'm Getting It (Score:5, Funny)
Parent
Lookahead/predictive branching is one option... (Score:5, Interesting)
I do see this move by Intel as a direct follow up to their plans to negate the processing advantages of today's video cards. Intel wants people running general purpose code to run it on their general purpose CPU's, not on their video cards using CUDA or the like. If the future of video game rendering is indeed ray-tracing (an embarrassingly parallel algorithm if ever there was one) then this move will also position Intel to compete directly with Nvidia for the raw processing power market.
One thing is for sure, there's a lot of coding to do. Very few programs currently make effective use of even 2 cores. Parallelization of code can be quite tricky, so hopefully tools will evolve that will make it easier for the typical code-monkey who's never written a parallel algorithm in his life.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
As a software engineer, I wonder the same thing.
Put simply, the majority of code simply doesn't parallelize well. You can break out a few major portions of it to run as their own threads, but for the most part, programs either sit around and wait for the user, or sit around and wait for hardware resources.
Within that, only those programs that wait for a particular hardware resource - CPU time - Even have the potential to benefit from more cores... And while a lot of those might split well into a few threads, most will not scale (without a complete rewrite to chose entirely different algorithms - If they even exist to accomplish the intended purpose) to more than a handful of cores.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
Parent
Re:Not Sure I'm Getting It (Score:5, Interesting)
While prefetching data can be done using a single core, your post in this context gives me a cool idea.
Who needs branch prediction when you could just have 2 cores running a thread? Send each one executing instructions without a break in the pipeline and sync the wrong core to the correct one once you know the result. You'd still have to wait for results before any store operations, but you should probably know the branch result by then anyway.
Parent
Re:Not Sure I'm Getting It (Score:5, Insightful)
That is what most current processors do and use branch prediction for. Even if you have a thousand cores, that's only 10 binary decisions ahead. You need to guess really well very often to keep your cores busy instead of syncing. Also, the further you're executing ahead, the more ultimately useless calculations are made, which is what drives power consumption up in long pipeline cores (which you're essentially proposing).
In reality parallelism is more likely going to be found by better compilers. Programmers will have to be more specific about the type of loops they want. Do you just need something to be performed on every item in an array or is order important? No more mindless for-loops for not inherently sequential processes.
Parent
Great... (Score:5, Funny)
As if Oracle licensing wasn't complicated enough already...
Memory bandwidth? (Score:5, Interesting)
Re:Memory bandwidth? (Score:5, Insightful)
Memory would have to be completely redefined. Currently, you have one memory bank that is effectively accessed serially.
Yes, in Intel land. AMD has this thing called NUMA. What do you think "HyperTransport" means?
Parent
you mean SGI (Score:5, Insightful)
SGI and or Cray were using NUMA a decade ago.
Parent
Disagreement about this trend (Score:5, Interesting)
At Supercomputing 2006, they had a wonderful panel [supercomputing.org] where they discussed the future of computing in general, and tried to predict what computers (especially Supercomputers) would look like in 2020. Tom Sterling made what I thought was one of the most insightful observations of the panel -- most of the code out there is sequential (or nearly so) and I/O bound. So your home user checking his email, running a web browser, etc is not going to benefit much from having all that compute power. (Gamers are obviously not included in this) Thus, he predicted, processors would max out at a "relatively" low number of cores - 64 was his prediction.
Re:Disagreement about this trend (Score:5, Funny)
Parent
Re:Disagreement about this trend (Score:5, Insightful)
My guess is 4 cores in 2008, 4 cores in 2009, moving to 8 cores through 2010. We may move to a new uber-core model once the software catches up, more like 6-8 years than 2-4. I'm positive we won't "max out" at 64 cores, because we're going to hit a per-core speed limit much more quickly than we hit a number-of-cores limit.
Parent
Re:Disagreement about this trend (Score:5, Interesting)
Architectures have changed and other stuff allow a current single core of a 3.2 to easily outperform the old 3.8's but then still why don't we see new 3.8's?
The Pentium 4 is, well, it's scary. It actually has "drive" stages because it takes too long for signals to propagate between functional blocks of the processor. This is just wait time, for the signals to get where they're going.
The P4 needed a super-deep pipeline to hit those kinds of speeds as a result, and so the penalty for branch misprediction was too high.
What MAY bring us higher clock rates again, though, is processors with very high numbers of cores. You can make a processor broad, cheap, or fast, but not all three. Making the processors narrow and simple will allow them to run at high clock rates and making them highly parallel will make up for their lack of individual complexity. The benefit lies in single-tasking performance; one very non-parallelizable thread which doesn't even particularly benefit from superscalar processing could run much faster on an architecture like this than anything we have today, while more parallelizable tasks can still run faster than they do today in spite of the reduced per-core complexity due to the number of cores - if you can figure out how to do more parallelization. Of course, that is not impossible [slashdot.org].
Parent
Re:Disagreement about this trend (Score:5, Insightful)
Web applications are becoming more AJAX'y all the time, and they are not sequential at all. Watching a video while another tab checks my Gmail is a parallel task. All indications are that people want to consume more and more media on their computers. Things like the MLB mosaic allow you to watch four games at once.
Have you ever listened to a song through your computer while coding, running an email program, and running an instant messaging program? There are four highly parallelizable tasks right there. Not compute intensive enough for you? Imagine the song compressed with a new codec that is twice as efficient in terms of size but twice as compute intensive. Imagine the email program indexing your email for efficient search, running algorithms to assess the email's importance to you, and virus checking new deliveries. Imagine your code editor doing on the fly analysis of what you are coding, and making suggestions.
"Normal" users are doing more and more with computers as well. Now that fast computers are cheap, people who never edited video or photos are doing it. If you want a significant market besides gamers who need more cores, it is people making videos, especially HD videos. Sure, my Grandmother isn't going to be doing this, but I do, and I'm sure my children will do it even more.
And don't forget about virus writers. They need a few cores to run on as well!
Computer power keeps its steady progress higher, and we keep finding interesting things to do with it all. I don't see that stopping, so I don't see a limit to the number of cores people will need.
Parent
been there, done that (Score:5, Funny)
Good idea (Score:5, Insightful)
It's a good idea.. Somewhat of the same idea that the Cell chip has going for it (and well, Phenom X3s). You make a product with lots of redunant objects so that when some are bound to failure, the percentage of failure is much lower..
If there are 1000 cores on a chip, and 100 go bad... You're still only losing a *maximum* of 10% of performance versus when you have 2 or 4 cores and 1 or 2 go bad, you have a performance impact of 50% essentially.. Brings costs down because yeilds go up dramatically.
Profit!!! (Score:5, Funny)
Databases and implimentation-neutrality (Score:5, Interesting)
Databases provide a wonderful opportunity to apply multi-core processing. The nice thing about a (good) database is that queries describe what you want, not how to go about getting it. Thus, the database can potentially split the load up to many processes and the query writer (app) does not have to change a thing in his/her code. Whether a serial or parallel process carries it out is in theory out of the app developer's hair (although dealing with transaction management may sometimes come into play for certain uses.)
However, query languages may need to become more general-purpose in order to have our apps depend on them more, not just business data. For example, built-in graph (network) and tree traversal may need to be added and/or standardized in query languages. And, we made need to clean up the weak-points of SQL and create more dynamic DB's to better match dynamic languages and scripting.
Being a DB-head, I've discovered that a lot of processing can potentially be converted into DB queries. That way one is not writing explicit pointer-based linked lists etc., locking one into a difficult-to-parallel-ize implementation.
Relational engines used to be considered too bulky for many desktop applications. This is partly because they make processing go through a DB abstraction layer and thus are not using direct RAM pointers. However, the flip-side of this extra layer is that they are well-suited to parallelization.
Re:Databases and implimentation-neutrality (Score:5, Informative)
By "a lot of processing can potentially be converted into DB queries", what you discovered is functional programming :) LINQ in .NET 3.5/C# 3.0 is an example of functional programming that is made to look like DB queries, but it isn't the only way. It is a LOT easier to convert that stuff and optimize it to the environment (like how SQL is processed), since it describes the "what" more than the "how". It is already done, and one (out of many examples) is Parallel LINQ, which smartly execute LINQ queries in parallel, optimized for the amount of cores, etc. (And I'm talking about LINQ in the context of in memory process, not LINQ to SQL, which simply convert LINQ queries into SQL ones).
Functional programming, tied with the concept of transactional memory to handle concurency, is a nice medium term solution to the multi-core problem.
Parent
Re:Useless (Score:5, Insightful)
Parent
Re:Ok.. so how do I do that? (Score:5, Informative)
A year or so ago, I saw a presentation on Thread Building Blocks [threadingb...blocks.org], which is basically an API thingie that Intel created to help with this issue. Their big announcement last year was that they've released it open-source and have committed to making it cross-platform. (It's in Intel's best interest to get people using TBB on Athlon, PPC, and other architectures, because the more software is multi-core aware, the more demand there will be for multi-core CPUs in general, which Intel seems pretty excited about.)
Parent
Re:Generic jokes (Score:5, Funny)
In the Soviet Union
Oh wait... the Soviet Union already broke into smaller cores.
Parent
Re:We all saw it coming anyway (Score:5, Insightful)
"So whether programmers find this move acceptable or not is irrelevant because this path is probably the only way to design faster CPU:s once we've hit the nanometer wall."
I guess you should put "faster" in quotes.
In any case, it is absolutely relevant what programmers think since any performance improvements that customers actually experience is dependent on our code.
Historically a primary reason to buy a new computer is because a faster system makes legacy applications run faster. To a large extent this won't be true with a new multicore PC. So why would people buy them?
That's why Intel wants us to redesign our software - so that in the future their customers will still have a reason to buy a new PC with Intel Inside.
Parent
Re:It's all changing too fast (Score:5, Insightful)
My dad's been programming for decades, and he's much more used to paradigm shifts than I am. His first programming job was translating assembly from one architechture to another, and now he's a proficient web developer. He understands concurrency and keeps up to date on new developments.
I'm reminded of an anecdote told to me during a presentation. The presenter had been introducing a new technology, and one man had a concern: "I've just worked hard to learn the previous technology. Can you promise me that, if I learn this one, it will be the last one I ever have to learn?" The presenter replied, "I can't promise you that, but I can promise you that you're in the wrong profession."
Parent
Re:It's all changing too fast (Score:5, Interesting)
For example, "CPU expensive, memory expensive, programmer cheap" is now "CPU cheap, memory cheap, programmer expensive" -- hence Java et al. (I am sometimes amazed when I casually allocate/free chunks of memory larger than all the combined memory of all the computers at my university - both in the labs and the administration/operational side - but what amazes me is that it doesn't amaze me!)
Actually some of the "old timers" may be a more comfortable with some issues of highly parallel programming than some of the "kids" (term used with respect, we were all kids once!) who have mostly had them masked from them by high level languages. Comparing "old timers" to "kids" doing enterprise server software, the kids seem much less likely to understand issues like memory coherence models of specific architectures, cache contention issues of specific implementations, etc.
Also, too often, the kids make assumptions about the source of performance/timing problems rather than gathering empirical evidence and acting on that evidence. This trait is particularly problematic because when dealing with concurrency and varying load conditions, intuition can be quite unreliable.
Really, it's not all that scary - the first paradigm shift is the hardest!
Parent