Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Intel Hardware

Intel Says to Prepare For "Thousands of Cores" 638

Impy the Impiuos Imp writes to tell us that in a recent statement Intel has revealed their plans for the future and it goes well beyond the traditional processor model. Suggesting developers start thinking about tens, hundreds, or even thousand or cores, it seems Intel is pushing for a massive evolution in the way processing is handled. "Now, however, Intel is increasingly 'discussing how to scale performance to core counts that we aren't yet shipping...Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander,' [Anwar Ghuloum, a principal engineer with Intel's Microprocessor Technology Lab] said. He says that the more radical programming path to tap into many processing cores 'presents the "opportunity" for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they've adhered to for (often) most of the their software's existence.'"
This discussion has been archived. No new comments can be posted.

Intel Says to Prepare For "Thousands of Cores"

Comments Filter:
  • by Delwin ( 599872 ) * on Wednesday July 02, 2008 @04:46PM (#24036011)
    Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.

    Then you take the tasks that can be broken up over multiple cores (Ray Tracing anyone?) and fill the rest of your cores with that.
  • It's already here. (Score:2, Informative)

    by GreatBunzinni ( 642500 ) on Wednesday July 02, 2008 @04:51PM (#24036077)

    We already have systems with tens and and hundreds of cores. Those processors already go by the name of "graphics card" and those changes in languages and libraries go by the name of CUDA, C2M, brook+ and the like.

    The only thing new that Intel brought to the table with this press release is the attempt to fool us into believe that there is nothing of the kind available and that Intel is somehow innovating in some aspect or another.

    Face it: the age of the "CPU is the computing muscle" is long gone.

  • Already Happening (Score:3, Informative)

    by sheepweevil ( 1036936 ) on Wednesday July 02, 2008 @04:55PM (#24036129) Homepage
    Supercomputers already have many more than thousands of cores. The IBM Blue Gene/P can have up to 1,048,576 cores [ibm.com]. What Intel is probably talking about is bringing that level of parallel computing to smaller computers.
  • by Phroggy ( 441 ) <slashdot3@ p h roggy.com> on Wednesday July 02, 2008 @05:02PM (#24036221) Homepage

    A year or so ago, I saw a presentation on Thread Building Blocks [threadingb...blocks.org], which is basically an API thingie that Intel created to help with this issue. Their big announcement last year was that they've released it open-source and have committed to making it cross-platform. (It's in Intel's best interest to get people using TBB on Athlon, PPC, and other architectures, because the more software is multi-core aware, the more demand there will be for multi-core CPUs in general, which Intel seems pretty excited about.)

  • by zarr ( 724629 ) on Wednesday July 02, 2008 @05:04PM (#24036253)
    How do those get sped up if you're opting for more cores instead of more cycles?

    Algorithms that can't be parallelized will not benefit from a parallel architecture. It's really that simple. :( Also, many algorithms that are parallelizable will not benefit from an "infinite" number of cores. The limited bandwith for communication between cores will usually become a bottleneck at some point.

  • by pimpimpim ( 811140 ) on Wednesday July 02, 2008 @05:26PM (#24036565)
    bingo. The problem is there. I've followed an introductory course on parallel programming (not saying I'm an expert, though), and while the idea of multiple processor programming is fairly simple, the implementation is amazingly difficult and painful.

    Example: "race condition" Say processor one is trying to find the optimal value of variable A, and processor two is doing something different, but calling some subfunction which changes variable A, then processor one might keep on running forever.

    The other main problem is the deadlock: Processor one needs the final result of variable B to calculate variable A, but processor two needs the final result of variable A to calculate B. Both processors will come to a standstill, and the program is halting forever.

    For simple programs, these things are relatively easy to troubleshoot. But for your huge program package with hundreds of modules, it is almost impossible to know what is happening.

    Actually, it is the duty of intel and co. to find a way to prevent these situations, but also there, what kind of genius is able to program an automated debugger that can find deadlocks and race conditions.

  • by Brian Gordon ( 987471 ) on Wednesday July 02, 2008 @05:43PM (#24036755)
    Are you crazy? Context switches are the slowdown in multitasking OSes.
  • Re:Memory bandwidth? (Score:1, Informative)

    by AnyoneEB ( 574727 ) on Wednesday July 02, 2008 @05:45PM (#24036787) Homepage
    Intel is finally catching up to AMD on that front with Nehalem [wikipedia.org].
  • by k8to ( 9046 ) on Wednesday July 02, 2008 @05:56PM (#24036917) Homepage

    True but misleading. The major cost of task switching is a hardware-derived one. It's the cost of blowing caches. The swapping of CPU state and such is fairly small by comparison, and the cost of blowing caches is only going up.

  • Re:I'm not bitter. (Score:2, Informative)

    by GatesDA ( 1260600 ) on Wednesday July 02, 2008 @06:07PM (#24037069)

    They'll have an excuse if we have 3D monitors at that point

    3D monitors already exist and are available for purchase; there are even some that don't need glasses. To go with those, nVidia has stereo drivers up on their website that will work on all their cards and with most games. (Last I checked, ATI's stereo drivers only work on their workstation cards).

    To make a game work in 3D, the graphics card just renders two images -- one for each eye; that's not enough work to be used as an excuse for poor performance. Of course, you can always increase the size of armies and such if you WANT to lower performance. They'll find a way.

    http://en.wikipedia.org/wiki/Autostereoscopy [wikipedia.org]

  • by rrohbeck ( 944847 ) on Wednesday July 02, 2008 @06:17PM (#24037201)

    Yup. Its Amdahl's law [wikipedia.org].

    This whole many core hype looks a lot like the Gigahertz craze from a few years ago. Obviously they're afraid that there will be no reason to upgrade. 2 or 4 cores, ok - you often (sometimes?) have that many tasks active. But significantly more will only buy you throughput for games, simulations and similar heavy computations. Unless we (IAACS too) rewrite all of our apps under new paradigms like functional programming (e.g. in Erlang [wikipedia.org].) Which will only be done if there's a good reason for it.

  • by kdemetter ( 965669 ) on Wednesday July 02, 2008 @06:26PM (#24037295)

    2001 : A Space Odyssey , by Arthur C. Clarke.
    Great book.

  • Re:Microsoft's reply (Score:2, Informative)

    by David Greene ( 463 ) on Wednesday July 02, 2008 @06:29PM (#24037323)
    That's no joke. It's not at all unusual to have to wait hours for tens of thousands of core files to be produced on large HPC machines. Debugging at scale is a really, really hard problem.
  • by jsebrech ( 525647 ) on Wednesday July 02, 2008 @06:29PM (#24037329)

    Architectures have changed and other stuff allow a current single core of a 3.2 to easily outperform the old 3.8's but then still why don't we see new 3.8's?

    Clock rate is meaningless. They could build a 10 ghz cpu, but it wouldn't outperform the current 3 ghz cpu's.

    A modern cpu uses pipelining. This means that each instruction is spread out across a series of phases (e.g. fetch data, perform calculation 1, perform calculation 2, store data). Each phase is basically a layer of transistors the logic has to go through. The clock rate simply is how often data is transferred to the next phase. The higher you push the clock, the faster instructions move through their phases towards completion. The problem is that the transistors in each phase take a while after every clock tick to stabilize. So, if you push the clock rate too high, the end result of your current phase won't have been reached yet, and you'll push garbage to the next phase. This is why a cpu that is overclocked too far will cause crashes. It simply doesn't do reliable calculation anymore.

    Now, the reason you had higher clock rates on the P4 architecture is that intel "solved" the clock rate problem by having more phases and making each phase shorter. Overall the cpu was less efficient, but they could put a bigger ghz number on the package, so marketing was happy. They've come back from that because they couldn't compete on cost/performance with someone who didn't do that (amd), and their current architecture has appropriate-length phases again, with a lower clock rate to match.

    Like you've observed however, overall the speed has gone up.

  • by blahplusplus ( 757119 ) on Wednesday July 02, 2008 @06:50PM (#24037555)

    "Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.

    Then you take the tasks that can be broken up over multiple cores (Ray Tracing anyone?) and fill the rest of your cores with that."

    Unfortunately all this is going to lead to bus and memory bandwidth contention, you're just shifting the burden from one point to another. Although their is a 'penalty' for task switching, there is an even greater bottleneck at the bus and memory bandwidth level.

    IMHO intel would have to release a cpu on a card with specialized ram chips and segment the ram like GPU's do to get anything out of multicore over the long term, ram is not keeping up and the current architecture for PC ram is awful for multicore. CPU speed is far outstripping bus and memory bandwidth. I am quite dubious of multi-core architecture, there is fundamental limits of geometry of circuits. I'd be sinking my money into materials research not glueing cores together and praying CS and math guys come up with solutions that take advantage of it.

    The whole of human history of engineering and tool use, is to take something extremely complicated and offload complexity, and compartmentalize it so that it's mangable. I see the opposite happening with multi-core.

  • by Shados ( 741919 ) on Wednesday July 02, 2008 @06:54PM (#24037619)

    By "a lot of processing can potentially be converted into DB queries", what you discovered is functional programming :) LINQ in .NET 3.5/C# 3.0 is an example of functional programming that is made to look like DB queries, but it isn't the only way. It is a LOT easier to convert that stuff and optimize it to the environment (like how SQL is processed), since it describes the "what" more than the "how". It is already done, and one (out of many examples) is Parallel LINQ, which smartly execute LINQ queries in parallel, optimized for the amount of cores, etc. (And I'm talking about LINQ in the context of in memory process, not LINQ to SQL, which simply convert LINQ queries into SQL ones).

    Functional programming, tied with the concept of transactional memory to handle concurency, is a nice medium term solution to the multi-core problem.

  • by skulgnome ( 1114401 ) on Wednesday July 02, 2008 @06:57PM (#24037641)

    No. I/O is the slowdown in multitasking OSes.

  • by Anonymous Coward on Wednesday July 02, 2008 @07:38PM (#24038029)

    Inasmuch as CM-1 and CM-2 machines, theparallelism you saw tended to be pure SIMD
    level (what TMC promoted as 'data level')parallel. This probably changed in CM-5,
    a machine I know less about...

    Pure SIMD level parallelism is /somewhat/ akin
    to what you see in GPGPU style projects,
    although even those (in certain GPU programming
    idioms) tend to 'see' things as collections of
    threads as opposed to large arrays of ALU/CPU
    sets working lock-step on an array.

    I would suggest that TMC had quite a bit going for
    it in terms of some of the novel thinking
    regarding how one might use such a beast as
    CM[125]. This is the problem in front of Intel
    inasmuch as how to tame so many cores: how do
    you think about these problems and match this
    model to the problem(s) at hand. Clearly, at
    very large N cores, the idiom will change
    because you can't just 're-write it in C with
    threads' or some other larger-granularity
    adaptation. This is the prize: taming wide-spreadparallelism, for systems programming (at the
    OS level) and for compute-intensive applications
    (It REALLY SUCKS to do MPI in Fortran......
    things have to change).

    There are some very significant and very
    interesting changes afoot over the next 10
    years in computer science due to the wide spread
    of parallelism. You used to have exotic machines
    for this. Now, it will be in laptops through
    supercomputers.....

    Very neat.

  • by Joren ( 312641 ) on Wednesday July 02, 2008 @07:53PM (#24038121) Homepage
    The "Control" meme is from Get Smart, which came out a week or two ago. So yes, it is pretty recent...unless you happen to have watched the series from the 60s.
  • by kahanamoku ( 470295 ) on Wednesday July 02, 2008 @08:03PM (#24038207)

    By definition, isn't a core just the middle/root of something? if you have more than 1 core, shouldn't the term really be changed to reflect something closer to which it represents?

  • by kramerd ( 1227006 ) on Wednesday July 02, 2008 @08:23PM (#24038357)
    Girls like it when you buy them things. Or when you pretend to listen. And when you shower.
  • by kesuki ( 321456 ) on Wednesday July 02, 2008 @08:46PM (#24038539) Journal

    "Take, for instance, the huge success of mp3's. There was a time not so long ago when people were limited to playing music off a physical CD. This wasn't because there was no desire amongst computer users to listen to digital files that could be stored locally or streamed off the internet. It was because computer users did not know yet that they had the desire to do it. But technology advanced to the point where a) processors became fast enough to decode mp3's in real time without using the whole CPU"

    I started making mp3s with a 486 DX 75mhz

    I could decode in real time on a 486 DX 75 as i recall encoding took a bit of time, and i only had a 3 GB HDD that had been an upgrade to the system...

    Mp3s use a asynchronous encoding algorithm, more CPU to encode, than to decode, if your MP3 player doesn't run correctly on a 486, then it's because they designed in features not strictly needed to decode a MP3 stream.

    Oh hey, I have an RCA Lyra mp3 player, that isn't even as fast as a 486, but the decoder was designed for mp3 decoding.

    Ogg decoding uses a beefier decoder, that's half the problem getting ogg support in devices not made for decoding video streams.

  • by Salamander ( 33735 ) <jeff@ p l . a t y p.us> on Wednesday July 02, 2008 @09:23PM (#24038823) Homepage Journal

    Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.

    OK, so now the piece that's running on each core runs really really fast . . . until it needs to wait for or communicate with the piece running on some other core. If you can do your piece in ten instructions but you have to wait 1000 for the next input to come in, whether it's because your neighbor is slow or because the pipe between you is, then you'll be sitting and spinning 99% of the time. Unfortunately, the set of programs that decompose nicely into arbitrarily many pieces that each take the same time (for any input) doesn't extend all that far beyond graphics and a few kinds of simulation. Many, many more programs hardly decompose at all, or still have severe imbalances and bottlenecks, so the "slow neighbor" problem is very real.

    Many people's answer to the "slow pipe" problem, on the other hand, is to do away with the pipes altogether and have the cores communicate via shared memory. Well, guess what? The industry has already been there and done that. Multiple processing units sharing a single memory space used to be called SMP, and it was implemented with multiple physical processors on separate boards. Now it's all on one die, but the fundamental problem remains the same. Cache-line thrashing and memory-bandwidth contention are already rearing their ugly heads again even at N=4. They'll become totally unmanageable somewhere around N=64, just like the old days and for the same reasons. People who lived through the last round learned from the experience, which is why all of the biggest systems nowadays are massively parallel non-shared-memory cluster architectures.

    If you want to harness the power of 1000 processors, you have to keep them from killing each other, and they'll kill each other without even meaning to if they're all tossed in one big pool. Giving each processor (or at least each small group of processors) its own memory with its own path to it, and fast but explicit communication with its neighbors, has so far worked a lot better except in a very few specialized and constrained cases. Then you need multi-processing on the nodes, to deal with the processing imbalances. Whether the nodes are connected via InfiniBand or an integrated interconnect or a common die, the architectural principles are likely to remain the same.

    Disclosure: I work for a company that makes the sort of systems I've just described (at the "integrated interconnect" design point). I don't say what I do because I work there; I work there because of what I believe.

  • by mrchaotica ( 681592 ) * on Wednesday July 02, 2008 @10:14PM (#24039117)

    but can we PLEASE work on getting apps to run on more than just ONE core/processor for now?

    Why?

    The kind of parallelism needed for a few cores (coarse-grained task parallelism) is entirely different than the kind of parallelism needed for hundreds or thousands of cores (fine-grained data parallelism). Designing for a few cores won't do us a damn bit of good when we have hundreds or thousands.

  • by tanadeau ( 518507 ) on Wednesday July 02, 2008 @11:10PM (#24039397)
    Declarative languages are ones like Prolog. You're talking about functional programming (Lisp, Haskell, Erlang, OCaml, etc.) which is a wholly different (and easier to understand) beast.
  • by Erich ( 151 ) on Wednesday July 02, 2008 @11:17PM (#24039425) Homepage Journal
    Single Address Space is horrible.

    It's a huge kludge for idiotic processors (like arm9) that don't have physically-tagged caches. On all non-incredibly-sucky processors, we have physically tagged caches, and so having every app have its own address space, or having multiple apps share physical pages at different virtual addresses, all of these are fine.

    Problems with SAS:

    • Everything has to be compiled Position-independent, or pre-linked for a specific location
    • Virtual memory fragmentation as applications are loaded and unloaded
    • Where is the heap? Is there one? Or one per process?
    • COW and paging get harder
    • People start using it and think it's a good idea.

    Most people... even people using ARM... are using processors with physically-tagged caches. Please, Please, Please, don't further the madness of single-address-space environments. There are still people encouraging this crime against humanity.

    Maybe I'm a bit bitter, because some folks in my company have drunk the SAS kool-aid. But believe me, unless you have ARM9, it's not worth it!

  • by dryeo ( 100693 ) on Thursday July 03, 2008 @01:00AM (#24039929)

    And before they made it into a movie it was an interesting short story. http://en.wikipedia.org/wiki/The_Sentinel_(short_story) [wikipedia.org]
    If you'd like to read it, seems it is this PDF, http://econtent.typepad.com/TheSentinel.pdf [typepad.com]

  • Gaming? (Score:3, Informative)

    by phorm ( 591458 ) on Thursday July 03, 2008 @01:31AM (#24040101) Journal
    I'd say that it could have a rather hefty impact on the graphics industry (though to be fair, both tend to share tech fairly regularly as it is) as well as many others.

    How about servers? If you have 1000 cores, and 1000 clients connecting through the network, then each core could service a client (though depending on what they're doing, IO and other issues also rear their heads). Another nice aspect would be that if you could fix a process to a certain # of cores, you could always be sure that it wouldn't max out your entire CPU capacity.
  • Re:Gaming? (Score:2, Informative)

    by walshy007 ( 906710 ) on Thursday July 03, 2008 @04:08AM (#24040601)
    "Another nice aspect would be that if you could fix a process to a certain # of cores" already can in linux, schedtool lets you set hard cpu affinities per process, you can let it only go on certain cores if you like

Today is a good day for information-gathering. Read someone else's mail file.

Working...