Linux May Need a Rewrite Beyond 48 Cores

Follow Slashdot stories on Twitter

Linux May Need a Rewrite Beyond 48 Cores 462

Posted by CmdrTaco on Thursday September 30, 2010 @12:47PM from the it's-all-stacking-blocks dept.

An anonymous reader writes "There is interesting new research coming out of MIT which suggests current operating systems are struggling with the addition of more cores to the CPU. It appears that the problem, which affects the available memory in a chip when multiple cores are working on the same chunks of data, is getting worse and may be hitting a peak somewhere in the neighborhood of 48 cores, when entirely new operating systems will be needed, the report says. Luckily, we aren't anywhere near 48 cores and there is some time left to come up with a new Linux (Windows?)."

This discussion has been archived. No new comments can be posted.

Linux May Need a Rewrite Beyond 48 Cores

Search 462 Comments Log In/Create an Account

Comments Filter:

Original Source and Actual Paper (Score:5, Informative)

by eldavojohn ( 898314 ) * writes: <eldavojohn@gma[ ]com ['il.' in gap]> on Thursday September 30, 2010 @12:48PM (#33748820) Journal

It appears that the problem, that affect the available memory in a chip when multiple cores are working on the same chunks of data, is getting worse and may be hitting a peak somewhere in the neighborhood of 48 cores, when entirely new operating systems will be needed, the report says.
Seriously? You picked that over my submission?

I submitted this earlier this morning I guess my submission was lacking [slashdot.org]. But if you're interested in the original MIT article [mit.edu] and the actual paper [mit.edu] (PDF):
eldavojohn writes "Multicore (think tens or hundreds of cores) will come at a price for current operating systems. A team at MIT found that as they approached 48 cores their operating system slowed down [mit.edu]. After activating more and more cores in their simulation, a sort of memory leak occurred whereby data had to remain in memory as long as a core might need it in its calculations. But the good news is that in their paper [mit.edu] (PDF), they showed that for at least several years Linux should be able to keep up with chip enhancements in the multicore realm. To handle multiple cores, Linux keeps a counter of which cores are working on the data. As a core starts to work on a piece of data, Linux increments the number. When the core is done, Linux decrements the number. As the core count approached 48, the amount of actual work decreased and Linux spent more time managing counters. But the team found that 'Slightly rewriting the Linux code so that each core kept a local count, which was only occasionally synchronized with those of the other cores, greatly improved the system's overall performance.' The researchers caution that as the number of cores skyrockets [slashdot.org], operating systems will have to be completely redesigned [slashdot.org] to handle managing these cores and SMP [wikipedia.org]. After reviewing the paper, one researcher is confident Linux will remain viable for five to eight years without need for a major redesign."
I don't know, guess I picked a bad title or something?
Luckily we aren't anywhere near 48 cores and there is some time left to come up with a new Linux (Windows?).
Again, seriously? What does "(Windows?)" even mean? As you pass a certain number of cores, modern operating systems will need to be redesigned to handle extreme SMP. It's going to differ from OS to OS but we won't know about Windows until somebody takes the time to test it.

Share
twitter facebook
Barrelfish (Score:1, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @12:50PM (#33748858)

This is exactly why people are doing research on Barrelfish (http://www.barrelfish.org/).

Share
twitter facebook
Re:Original Source and Actual Paper (Score:0, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @12:53PM (#33748904)

I’d just like to interject for a moment. What you’re refering to as Linux, is in fact, GNU/LInux, or as I’ve recently taken to calling it, GNU plus Linux. Linux is not an operating system unto itself, but rather another free component of a fully functioning GNU system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX.
Many computer users run a modified version of the GNU system every day, without realizing it. Through a peculiar turn of events, the version of GNU which is widely used today is often called “Linux”, and many of its users are not aware that it is basically the GNU system, developed by the GNU Project.
There really is a Linux, and these people are using it, but it is just a part of the system they use. Linux is the kernel: the program in the system that allocates the machine’s resources to the other programs that you run. The kernel is an essential part of an operating system, but useless by itself; it can only function in the context of a complete operating system. Linux is normally used in combination with the GNU operating system: the whole system is basically GNU with Linux added, or GNU/Linux. All the so-called “Linux” distributions are really distributions of GNU/Linux.
So this blog needs to be renamed to the GNU/Linux Hater's Blog. Have a nice day.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:5, Informative)

by eudaemon ( 320983 ) writes: on Thursday September 30, 2010 @12:56PM (#33748964)

I just laughed at the "we aren't anywhere near 48 cores" comment - there are already commercial products with more than 48 cores now. I mean even a crappy old T5220 pretends to have 64 CPUs due to the 8 CPU, 8 thread design.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:2, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @12:56PM (#33748966)

I don't know, guess I picked a bad title or something?
No. Your summary was too long.
Seriously, the purpose of a summary is not to include every last fact and detail mentioned in the article; it's to give the reader enough information to decide whether reading the full article is worth it. Don't try to put everything in there.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:4, Informative)

by WinterSolstice ( 223271 ) writes: on Thursday September 30, 2010 @01:06PM (#33749116)

Got a pile of AIX servers here like that:
http://www-03.ibm.com/systems/power/hardware/780/index.html [ibm.com]
I was kind of wondering about the "modern operating systems" comment... I think he meant "desktop operating systems".
Many of the big OS vendors (IBM, DEC (now HP), CRAY, etc) are well beyond this point. Even OS/2 could scale to 1024 processors if I recall correctly.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by Skal Tura ( 595728 ) writes: on Thursday September 30, 2010 @01:07PM (#33749144) Homepage

nevermind quite an standard server, a dual xeon 6core HT... total reported CPUs is 24, and it's quite a lot used and nothing special.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:1, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @01:10PM (#33749214)

The effect is otherwise known as "Amdahls Law", well documented by Gene Amdahl in 1967. Is this news at all?

Parent Share
twitter facebook
Re:Linux already runs on thousands of cores (Score:5, Informative)

by DrgnDancer ( 137700 ) writes: on Thursday September 30, 2010 @01:15PM (#33749274) Homepage

I thought this as well, but after more carefully reading the article, I *think* I see what the problem is. It's not really a problem with large numbers of cores in a system, so much as a problem with large numbers of cores on a chip. Since the multicore chips share caches (level 2 cache is shared, level 1 cache isn't IIRC, but I could be wrong) it's actually cache memory where the issue lies. I've worked on single system image SGI systems with 512 cores, but those systems were actually 256 dual core chips. That works fine, and assuming well written SMP code performance scales as you'd expect with number of cores.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by Dahamma ( 304068 ) writes: on Thursday September 30, 2010 @01:16PM (#33749282)

the purpose of a summary is not to include every last fact and detail mentioned in the article; it's to give the reader enough information to decide whether reading the full article is worth it.
If you think a summary can actually help get a /. reader to RTFA, you must be new here...

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:5, Informative)

by BeardedChimp ( 1416531 ) writes: on Thursday September 30, 2010 @01:16PM (#33749300)

The purpose of an editor is to edit any submissions to make them ready for print.

If the summary was too long, the editor should have got off his arse rather than wait for the summary that fits the word count to come along.

Parent Share
twitter facebook
48 Cores in 1U (Score:2, Informative)

by kybur ( 1002682 ) writes: on Thursday September 30, 2010 @01:20PM (#33749350)

I'm not affiliated with Supermicro in any way, but they have four 1U serverboards designed for the 12 core opterons, so that's 48 cores in a 1U server. I'm guessing that Supermicro is not the only vendor of quad opteron boards supporting the latest chips. There are most likely quite a few of these in use by real people. Anyone want to speak up?
I know from personal experience that the socket F opterons performed very poorly in an 8 way configuration compared to the previous generation (socket 940 gen). I ran multiple tests on dual core chips (885s, I think), back in 2006 or 7 where I'd get nearly double the performance in going from a quad configuration to an 8 way configuration, but with the socket F breed of chips, there was no performance boost at all, it was like the clock speed was being cut in half and all the threads took twice as long to complete. I saw this behavior again and again, and the motherboard manufacturer that I was testing the chips with told me that it was an issue with the chips themselves. I think this is the reason why 8-way opteron systems are very rare now.

Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @01:22PM (#33749376)

OS/2's SMP support is a joke. I'm sure that somewhere in that tangle is a comment like "up to 1024 processors". But it's as relevant as a sticker on a Ford Cortina warning not to exceed the speed of sound.
Officially the SMP version of OS/2 "Warp Server" supported 64 processors. In practice anything other than an embarrassingly parallel task would see rapidly diminishing returns after just a couple of CPUs. The stuff that this article is moaning about, that Linux doesn't do well enough on 48 CPUs? OS/2 doesn't even attempt it, the official docs just say to "avoid" such things. This test case on 48 CPUs on OS/2 would just leave the OS constantly thrashing trying to move pages from one CPU to another, and no work being done.
Now maybe if OS/2 had been a huge success, and IBM were now the dominant OS vendor on the desktop, there'd be a 1024 CPU version of OS/2 today. But in our reality, where OS/2 support was gradually abandoned and handed over to an underfunded little independent outfit, it sucks on SMP.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by X0563511 ( 793323 ) writes: on Thursday September 30, 2010 @01:25PM (#33749422) Homepage Journal

I've seen longer stories about lamer things get published...

Parent Share
twitter facebook
I don't understand... (Score:2, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @01:27PM (#33749462)

I'm trying to understand the point of this article..Do we really need a new paper to say that centralized memory bandwidth is at some point a limiting problem in an SMP environment? Isn't this why we have NUMA?
If you want to go after linux internals like the BKL more power to you but that horse left the stable a long long time ago as well.
You could talk about the software problem in dealing with decentralized memory access, synchronization, scalable algorithms...etc but this is all likely something needing to be addressed in application space rather than at the kernel where this paper seems to focus.
There are no shortage of huge single system image linux systems with thousands of processor cores and not a single one of them use SMP architecture. They are all NUMA based (decentralized memory access).

Share
twitter facebook
Re:Only Linux? (Score:3, Informative)

by wastedlife ( 1319259 ) writes: on Thursday September 30, 2010 @01:27PM (#33749464) Homepage Journal

They did not "rewrite the kernel" for 7. They updated the code, just like every other piece of software normally does when it moves from version to version. Rewriting the kernel implies that they tore it down and started over, which is most certainly not true. Vista/2008 is NT version 6.0, 7/2008 R2 is NT version 6.1, not a rewrite.

Parent Share
twitter facebook
Patches available (Score:4, Informative)

by diegocg ( 1680514 ) writes: on Thursday September 30, 2010 @01:30PM (#33749512)

So, they found scalability problems in some microbenchmarks. Well, some of the scalability paths cited in the paper [mit.edu] will be fixed when Nick Piggin's VFS scalability patchset gets merged. But it's not like you need to rewrite every operative system to scale beyond 48 cores, it's just the typical scalability stuff, and the kind of scalability issues found these days are mostly corner cases (Piggin's VFS being an exception).

Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by aywwts4 ( 610966 ) writes: on Thursday September 30, 2010 @01:30PM (#33749516)

If it is any consolation this straw is the one that broke the RSS feed's back.
I have unsubscribe from Slashdot today due to the trend typified in your article VS the one published. (No this is not a new trend, but I'm fed up and finished with it.) See you on Reddit's Science/Linux/Everything else

Parent Share
twitter facebook
Re:What are they talking about (Score:1, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @01:31PM (#33749526)

Memory pages are reference counted. Some of the pages are shared and the cores spend a lot of time reference counting. There is a point where the reference counting overhead dominates. It is hypothesized that this could be fixed by enhancing the OS to isolate pages to physical cores thereby removing the need to reference count; this is a fundamental change to the structure of traditional virtual memory management.

Parent Share
twitter facebook
Just the cache problem (Score:5, Informative)

by Todd Knarr ( 15451 ) writes: on Thursday September 30, 2010 @01:31PM (#33749528) Homepage

What they're saying is basically two things:
First, there's a bottleneck in the on-chip caches. When a core's working on data it needs to have it in it's cache. And if two cores are working on the same block of memory (block size being determined by cache line size), they need to keep their copies of the cache synchronized. When you get a lot of cores working on the same block of memory, the overhead of keeping the caches in sync starts to exceed the performance gains from the additional cores. That's not new, we've known that in multi-threaded programming for decades: when you've got a lot of threads dependent on the same data items, the locking overhead's going to be the killer. And we've known the solution for just as long: code to avoid lock contention. The easiest is to make it so you don't have multiple threads (cores) working on the same (non-read-only) memory at the same time, that just requires some thinking on the part of the developers.
Second, you only gain from additional cores if there's workload to spread to them usefully. If you've got 8 threads of execution actually running at any given time, you won't gain from having more than 8 cores. And on modern computers often we don't have more than a few threads actually using CPU time at any given moment. The rest are waiting on something and don't need the CPU and, as long as we aren't thrashing execution contexts too badly, they can be ignore from a performance standpoint. To take advantage of truly large numbers of cores, we need to change the applications themselves to parallelize things more. But often applications aren't inherently multi-threaded. Games, yes. Computation, yes. But your average word processor or spreadsheet? It's 99% waiting on the human at the keyboard. You can do a few things in the background, file auto-save and such, but not enough to take advantage of a large number of cores. The things that really take advantage of lots of cores are things like Web servers where you can assign each request to it's own core. And no, browsers don't benefit the same way. On the client side there are so (relatively) few requests and network I/O's so slow relative to CPU speed that you can handle dozens of requests on a single core and still have cycles free assuming you use an efficient I/O model. But it all boils down to the developers actually thinking about parallel programming, and I've noticed a lot of courses of study these days don't go into the brain-bending skull-sweat details of juggling large numbers of threads in parallel.

Share
twitter facebook
Re:What are they talking about (Score:5, Informative)

by jd ( 1658 ) writes: <imipak@yahoGINSBERGo.com minus poet> on Thursday September 30, 2010 @01:32PM (#33749556) Homepage Journal

What they are talking about really reduces to a variant of Ahmdals Law, but simply put scaling is always non-linear. There will be overheads per core for communication (why is why SMP over 16 CPUs is such a headache) and overheads per core within the OS for housekeeping (knowing what core a specific thread is running on, whether it is bound to that core, etc, and trying to schedule all threads to make best use of the cores available).
The more cores you have, the more state information is needed for a thread and the more possible permutations the scheduler must consider in order to be efficient. Which, in turn, means the scheduler is going to be bulkier.
(Scheduling is a variant of the box-packing problem, which is an NP-Complete problem, but it has the added catch that you only get a very short time to pack the threads in and scheduling policies - such as realtime and core-binding - must also be satisfied in addition to packing all the threads in.)
The more of this extra data you need, the slower task-switching becomes and the more of the cache you are hogging with stuff not actually tied to whatever the threads are actually doing. At some point, the degradation in performance will exactly equal the increase in performance for the extra cores. The claim is that this happens at 48 cores for modern OS'. This is plausible but it is unclear if it is an actual problem. Those same OS' are used on supercomputers of 64+ cores, by segregating the activities in each node. MOSIX, Kerrighd and other such mechanisms have allowed Linux kernels to migrate tasks from one node to another transparently. (ie: You don't know or care where the code runs, the I/O doesn't change at all.) The only reason Linux doesn't have clustering as standard is that Linus is waiting for cluster developers to produce a standard mechanism for process migration that also fits within the architectural standards already in use.
If you clustered a couple of hundred nodes, each with 48 cores, you're looking at having around 2000+ on the system. It wouldn't take a "rewrite" per-se, merely a few hooks and a standard protocol. To support a single physical node with more than 48 cores, you might need to split it into virtual nodes with 48 or fewer cores in each, but Linux already has support for virtualization so that's no big deal either.

Parent Share
twitter facebook
Re:Sun E10Ks were at 72 cores over a decade ago (Score:3, Informative)

by jedidiah ( 1196 ) writes: on Thursday September 30, 2010 @01:33PM (#33749572) Homepage

An E10K is a glorified network computing cluster.
It's not what's being discussed at all.

Parent Share
twitter facebook
Re:Obligatory xkcd reference (Score:3, Informative)

by diegocg ( 1680514 ) writes: on Thursday September 30, 2010 @01:34PM (#33749578)

Yes, you can play smooth full-screen video in Linux with the "Square" preview release [adobe.com] (which includes 64 bit support). Full-screen 720p video only uses 30-40% of the CPU on my crappy Intel graphics chip, and it's completely smooth.

Parent Share
twitter facebook
K42: these problems were already tackled (Score:5, Informative)

by compudj ( 127499 ) writes: on Thursday September 30, 2010 @01:36PM (#33749616) Homepage

The K42 project [ibm.com] at IBM Research investigated the benefit of a complete OS rewrite with scalability to very large SMP systems in mind. This is an open source operating system supporting Linux-compatible API and ABI.
Their target systems, "next generation SMP systems", back in 2003 seems to have become the current generation of SMP/multi-core systems in the meantime.

Share
twitter facebook
Re:Only Linux? (Score:3, Informative)

by tibman ( 623933 ) writes: on Thursday September 30, 2010 @01:38PM (#33749654) Homepage

The problem isn't scaling to that number of cores but the overhead in doing so. That's what i took from it

Parent Share
twitter facebook
Re:seeing as Linux does 10240 cores already, WTF? (Score:4, Informative)

by Unequivocal ( 155957 ) writes: on Thursday September 30, 2010 @01:41PM (#33749718)

I think specifically they are talking about having 48 cores behind an L2 cache. Or 48 cores on a single die. Multi-CPU boxes generally communicate between CPU dies via the bus and from what little I can gather, that helps reduce or eliminate the issue they're describing..

Parent Share
twitter facebook
Re:OpenIndiana?? (Score:3, Informative)

by h4rr4r ( 612664 ) writes: on Thursday September 30, 2010 @01:44PM (#33749760)

OpenSolaris is dead. Solaris sucks to use without GNU userland anyway and being sued by oracle is no fun. Besides you troll, this would not need a new linux, just some small changes to the current one.

Parent Share
twitter facebook
Tilera? (Score:3, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @01:44PM (#33749766)

Tilera Corp. already has CPU architecture with 16-100 cores per chip.
TILE-Gx family [tilera.com]
Support for these is already being included in the mainline kernel.

Share
twitter facebook
Re:Original Source and Actual Paper (Score:5, Informative)

by monkeySauce ( 562927 ) writes: on Thursday September 30, 2010 @01:51PM (#33749892) Journal

The article is about cores per chip, not cores per system.

You're trying to compare a 48-cylinder engine with a bunch of 4-cylinder engines working together.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:1, Informative)

by Anonymous Coward writes: on Thursday September 30, 2010 @01:54PM (#33749958)

What the article is referring to is the number of cores PER SOCKET. Yes, you have some big computers with 48 cores across multiple sockets right now, but you do not have 48 cores in a single socket. I think the article is referring to a counter that Linux maintains per socket.
The other trend to keep in mind is non-uniform memory access (NUMA). There is memory associated with each socket of a machine. It is more expensive to access memory on a different socket. To help with this, you try to keep memory accesses local. This is most likely why Linux would maintain a counter PER SOCKET, because that would keep all the memory accesses to the counter local.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by bberens ( 965711 ) writes: on Thursday September 30, 2010 @01:56PM (#33750016)

A CPU can contain multiple cores which share Level 2 cache. Conversely a multi-CPU system has multiple complete CPUs which do not share their L2 cache.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:4, Informative)

by Surt ( 22457 ) writes: on Thursday September 30, 2010 @02:02PM (#33750114) Homepage Journal

This is not Amdahl's law, this is the dispatcher being inefficient.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:4, Informative)

by jgagnon ( 1663075 ) writes: on Thursday September 30, 2010 @02:09PM (#33750216)

To elaborate slightly further... If you had two CPUs on your motherboard with 8 cores each and four threads of execution per core, you'd have a total of: 2 CPUs, 16 cores, and 64 threads of execution.

Parent Share
twitter facebook
BS on not being near 48 cores... I have 34 already (Score:3, Informative)

by Fallen Kell ( 165468 ) writes: on Thursday September 30, 2010 @02:41PM (#33750702)

I have 34 systems which have 48 cores already in the server room. These are quad socket systems with 4 AMD 12-core CPU's. So I call BS to the guys who think we have plenty of time, because there are plenty of people deploying these things already.

Share
twitter facebook
Re:Original Source and Actual Paper (Score:2, Informative)

by bn557 ( 183935 ) writes: on Thursday September 30, 2010 @02:44PM (#33750734) Homepage Journal

Cores often share cache. Separate CPUs rarely do. The problem in this case is, when you approach 48 Cores in 1 CPU, the accounting task for the cache users starts growing out of proportion to the performance gain from adding cores.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by mitgib ( 1156957 ) writes: on Thursday September 30, 2010 @02:58PM (#33750926) Homepage Journal

You can have 48 cores today with a Quad G34 [supermicro.com] motherboard.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:5, Informative)

by dAzED1 ( 33635 ) writes: on Thursday September 30, 2010 @03:01PM (#33750974) Journal

and YET...that's irrelevant, because as many people have pointed out the problem is the cores that share L2 cache. There have been large systems with many, many processors for a long time, some of which run Linux. The problem that was described was 48cores on a single die, sharing the same cache. Sun's die-to-die tech isn't relevant to this problem, nor is putting more than 6 8-core CPUs in a single system.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:4, Informative)

by mlts ( 1038732 ) * writes: on Thursday September 30, 2010 @03:04PM (#33751010)

I saw earlier today on another news site a post about something similar saying that no OS commercially made can support more than 32 cores.
One of the followup postings was someone with an IBM 780 doing a prtconf|grep proc and showing 64 virtual processors on an LPAR. AIX supports up to 256 CPUs (physical or virtual.) I'm sure Solaris can do similar without breaking a sweat.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:5, Informative)

by dgatwood ( 11270 ) writes: on Thursday September 30, 2010 @03:16PM (#33751174) Homepage Journal

Well, gcc is likely to keep being the world's de facto C compiler (though even this was mainly because of the egcs fork way back when).
Actually, I doubt that is true. At this point, the commercial UNIX vendors and the BSDs seem to be putting their weight behind Clang/LLVM/LLDB, in large part due to GCC going GPLv3. In addition to being a cleaner architecture that's easier to enhance than GCC, it is also faster, and it often produces much better code as well. The GNU toolchain's days as the de facto standard are numbered, IMHO.
Back on topic, it occurs to be that large clusters with hundreds of cores start to inherently behave a lot more like NUMA and really need to be treated that way. Note that lots of modern OSes, including Linux, have supported NUMA in the past, so suggesting that it requires a completely rewritten OS is a preposterous assertion. That's not at all what this article is saying. What this article is saying is that tasks often are not easily divisible into tasks small enough to take advantage of multiple cores, and that managing processor affinity to ensure that threads working on the same data are run on the cores within the same physical die starts to become an unmanageable problem past a certain point.
In effect, what it is saying is that barring interconnect improvements, for many classes of problems, the performance penalty caused by multiple cores needing to access the same data exceeds the performance gain from adding additional cores at or around 48 cores. No OS change will help this, and in many cases, no software changes can help this, either. Most computing tasks are simply not massively parallelizable. This conclusion should be entirely expected by anybody who has ever tried to parallelize software to any real degree, but it's always good to see studies that bear out.
Put another way, once you exceed about 48 cores, the cores start to act more like clusters than cores. You start to see more and more accesses in which one CPU has to force data out of another CPU's cache. The nonuniformity of memory accesses starts to dominate the access times. Thus, past about that point (and probably much lower for most problems), adding more cores no longer improves performance. Even for massively parallelizable problems like video compression, once you exceed a certain number of nodes doing the work, the time spent assembling the final data actually exceeds the performance win achieved by adding additional processing nodes. This is completely straightforward, completely understood by real-world computer programmers, and shouldn't really be a surprise to anyone.
I'm not convinced an OS change can fix this, nor even an architectural change, though both can help to some degree by making parallelization easier (e.g. by providing APIs for supporting work units arranged in a dependency graph like GCD as an alternative to raw thread-based APIs). At some point, though, you're bounded by the number of distinct pieces that a problem can be divided into that don't depend on the output of any other piece, and once you hit that limit, adding additional computational units can only hinder performance, not help it. Your only real choices, then, are to find new and interesting ways to refactor the problem so that this is no longer the case, to change the structure of the input data to remove dependencies, to increase the speed of the individual CPU cores, or to turn the machines loose processing more than one problem at any given time to keep the remaining cores occupied.
Oh, yeah, and there's one other change that helps a lot: keep your read-only data in read-only pages, and write your code so that results go somewhere else. Read-only pages can be cached in every CPU without any real cache coherency overhead, at least in theory (I'm assuming that most modern CPUs do this), which means that input data sharing between CPUs doesn't matter. This design, combined with lockless work unit APIs, can make a huge difference in how many CPU
Read the rest of this comment...

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:4, Informative)

by DrgnDancer ( 137700 ) writes: on Thursday September 30, 2010 @03:24PM (#33751298) Homepage

SGI runs Single System Image Linux systems with over 1000 cores, that's not the problem. If you read the article it seems that they aren't talking about the number of cores in the system, they're talking about the number of cores on a chip. Multicore chips use shared caches. the problem is that the algorithms used to handle CPU caching don't scale to really huge numbers of cores sharing the cache in a single chip. Having 4X16 core chips will work fine, having a single 64 core chip will present difficulties. At least that's how I understand the article.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:5, Informative)

by joib ( 70841 ) writes: on Thursday September 30, 2010 @04:35PM (#33752388)

Unfortunately, the summary as well as the short articles on the web were more or less completely missing the point. The actual paper ( http://pdos.csail.mit.edu/papers/linux:osdi10.pdf [mit.edu] ) explains what was done.
Essentially they benchmarked a number of applications, figured out where the bottlenecks were, and fixed them. Some of the things they fixed where done by introducing "sloppy counters" in order to avoid updating a global counter. Others were to switch to more fine-grained locking, switching to per-cpu data structures, and so forth. In other words, pretty standard kernel scalability work. As an aside, a lot of the VFS scalability work seems to clash with the VFS scalability patches by Nick Piggin that are in the process of being integrated into the mainline kernel.
And yes, as the PDF article explains, the Linux cpu scheduler mostly works per-core, with only occasional communication with schedulers on other cores.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:4, Informative)

by im_thatoneguy ( 819432 ) writes: on Thursday September 30, 2010 @05:17PM (#33753028)

The original summary was lacking but the alternative proposed summary was WAY too long.
It's just supposed to pique my interest enough to read the article, not run several pages.

Parent Share
twitter facebook
Re:Windows 7 scales to 256 cores (Score:3, Informative)

by walshy007 ( 906710 ) writes: on Friday October 01, 2010 @03:00AM (#33756354)

The point is the article dealing with a simulated theoretical cpu with 48+ cores on a single die with shared l2 cache.
The changes made are incremental and I imagine will be dealt with long before this actually becomes an issue when (or if) we get cpus with that many cores on a single die.
multi socket systems are already immune to this the way it is setup, you could have an 8 socket system with each cpu having 8 cores and it would not show the problems shown in the article.
In other words, business as usual, the kernel gets optimized for hardware that actually exists or will exist in the near future. 48 core single cpus are a few years away, and the changes to accomodate them don't require anything significant so I'm sure it will be dealt with at the time.

Parent Share
twitter facebook
Re:Original Source and Actual Paper (Score:3, Informative)

by wastedlife ( 1319259 ) writes: on Friday October 01, 2010 @03:15PM (#33763656) Homepage Journal

While NT was originally supposed to be called OS/2 3.0, it was a new OS developed by Cutler and some other devs from DEC, not continued development of the OS/2 code.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Original Source and Actual Paper (Score:5, Informative)

Barrelfish (Score:1, Informative)

Re:Original Source and Actual Paper (Score:0, Informative)

Re:Original Source and Actual Paper (Score:5, Informative)

Re:Original Source and Actual Paper (Score:2, Informative)

Re:Original Source and Actual Paper (Score:4, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Re:Original Source and Actual Paper (Score:1, Informative)

Re:Linux already runs on thousands of cores (Score:5, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Re:Original Source and Actual Paper (Score:5, Informative)

48 Cores in 1U (Score:2, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

I don't understand... (Score:2, Informative)

Re:Only Linux? (Score:3, Informative)

Patches available (Score:4, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Re:What are they talking about (Score:1, Informative)

Just the cache problem (Score:5, Informative)

Re:What are they talking about (Score:5, Informative)

Re:Sun E10Ks were at 72 cores over a decade ago (Score:3, Informative)

Re:Obligatory xkcd reference (Score:3, Informative)

K42: these problems were already tackled (Score:5, Informative)

Re:Only Linux? (Score:3, Informative)

Re:seeing as Linux does 10240 cores already, WTF? (Score:4, Informative)

Re:OpenIndiana?? (Score:3, Informative)

Tilera? (Score:3, Informative)

Re:Original Source and Actual Paper (Score:5, Informative)

Re:Original Source and Actual Paper (Score:1, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Re:Original Source and Actual Paper (Score:4, Informative)

Re:Original Source and Actual Paper (Score:4, Informative)

BS on not being near 48 cores... I have 34 already (Score:3, Informative)

Re:Original Source and Actual Paper (Score:2, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Re:Original Source and Actual Paper (Score:5, Informative)

Re:Original Source and Actual Paper (Score:4, Informative)

Re:Original Source and Actual Paper (Score:5, Informative)

Re:Original Source and Actual Paper (Score:4, Informative)

Re:Original Source and Actual Paper (Score:5, Informative)

Re:Original Source and Actual Paper (Score:4, Informative)

Re:Windows 7 scales to 256 cores (Score:3, Informative)

Re:Original Source and Actual Paper (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals