Forgot your password?
typodupeerror
Operating Systems Software Windows Hardware Linux Technology

Windows and Linux Not Well Prepared For Multicore Chips 626

Posted by timothy
from the until-that-invisible-hand-flexes dept.
Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."
This discussion has been archived. No new comments can be posted.

Windows and Linux Not Well Prepared For Multicore Chips

Comments Filter:
  • by davecb (6526) * <davec-b@rogers.com> on Sunday March 22, 2009 @02:40PM (#27290145) Homepage Journal

    Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.

    Secondly. it's the same FUD as we heard from uniprocessor manufacturers when multiprocessors first came out: this new "symmetrical multiprocessing" stuff will never work, it'll bottleneck on locks.

    The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.

    The ones in the middle are the problem, as they need to coordinate to some degree, and don't do that well. It's a research area in computer science, and one of the interesting areas is in transactional memory.

    That's what the folks at the Multicore Expo are worried about: Linux itself is fine, and has been for a while.

    --dave

  • Grand Central (Score:4, Informative)

    by tepples (727027) <tepples AT gmail DOT com> on Sunday March 22, 2009 @02:42PM (#27290197) Homepage Journal
    Anonymous Coward wrote:

    get a mac..

    I assume you're talking about Mac OS X 10.6 (Snow Leopard), whose Grand Central framework [wikipedia.org] is supposed to add some tools to make Mac-exclusive multithreaded apps easier to program.

  • by tepples (727027) <tepples AT gmail DOT com> on Sunday March 22, 2009 @02:45PM (#27290237) Homepage Journal

    imagine software being developed for imaginary or speculatory hardware.

    I think Sun called it "Java". It was run on emulators [wikipedia.org] long before ARM and others came out with hardware-assisted JVMs such as Jazelle [wikipedia.org].

  • by Troy Baer (1395) on Sunday March 22, 2009 @02:57PM (#27290411) Homepage

    The /. summary of TFA is almost exquisitely bad. It's not Window or Linux that's not ready for multicore (as both have supported multi-processor machines for on the order of a decade or more), but rather the userspace applications that aren't ready. The reason is simple: Parallel programming is rather hard, and historically most ISVs have haven't wanted to invest in it because they could rely on the processors getting faster every year or two... but no longer.

    One area where I disagree with TFA is the claimed paucity of programming models and tools. Virtually every OS out there supports some kind of concurrent programming model, and often more than one depending on what language is used -- pthreads [wikipedia.org], Win32 threads, Java threads, OpenMP [openmp.org], MPI [mpi-forum.org] or Global Arrays [pnl.gov] on the high end, etc. Most debuggers (even gdb) also support debugging threaded programs, and if those don't have enough heft, there's always Totalview [totalview.com]. The problem is that most ISVs have studiously avoided using any of these except when given no other choice.

    --t

  • by tepples (727027) <tepples AT gmail DOT com> on Sunday March 22, 2009 @03:00PM (#27290439) Homepage Journal

    Most programs barely use any computational power, in fact there are very few programs that require all that computing power to operation and those are certainly well designed.

    Home users do use some apps that could benefit from multiple cores. Video encoding is one of them, but that one is embarrassingly parallel because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.

  • by 0123456 (636235) on Sunday March 22, 2009 @03:03PM (#27290467)

    Did you ever follow the Occam language? It seemed to have parallelization intrinsic, but it never went anywhere.

    Occam was heavily tied into the Transputer, and without the transputer's hardware support for message-passing, it's a bit of a non-starter.

    It also wasn't easy to write if you couldn't break your application down into a series of simple processes passing messages to each other. I suspect it would go down better today now people are used to writing object-oriented code, which is a much better match to the message-passing idea than the C code that was more common at the time.

  • Re:BeOS (Score:3, Informative)

    by yakumo.unr (833476) on Sunday March 22, 2009 @03:07PM (#27290515) Homepage
    So you missed Zeta then ? http://www.zeta-os.com/cms/news.php [zeta-os.com] (change to English via the dropdown on the left)
  • Re:Adapt (Score:5, Informative)

    by Dolda2000 (759023) <fredrik@dolPASCA ... m minus language> on Sunday March 22, 2009 @03:28PM (#27290731) Homepage

    Since the normal OoO parallelization mechanisms don't scale well enough

    It hit me that this probably wasn't obvious to everyone, so just to clarify: "OoO", here, stands not for Object-Oriented Something, but for Out-of-Order [wikipedia.org], as in how current, superscalar CPUs work. See also Dataflow architecture [wikipedia.org].

  • Re:That's a big leap (Score:3, Informative)

    by phantomfive (622387) on Sunday March 22, 2009 @03:45PM (#27290917) Journal

    So there's still a big need for understanding how to take advantage of more cores - because chips aren't really getting faster these days so much as more cores are being added.

    OK, so we can go into more detail. For most programs, parallelization will do essentially nothing. There are a few programs that can benefit from it, as you've mentioned. But those programs are already taking advantage of them, not only do video encoding programs use multiple cores, some can even farm the process out over multiple systems. So it isn't a matter of programmers being lazy, or tools not being available, it's a matter of in most cases, multiple cores won't make a difference. If you run windows, open the task manager and check how often the CPU is completely occupied. Rarely.

    Javascript is an interesting example, because in the last few months we've had something of a competition between browser makers to see who could get the fastest javascript. Now, I'm not going to go read through the changelogs, but I'm willing to bet that the biggest speed ups haven't been from making it multi-threaded, but rather from standard optimization techniques. Basically they went through with a profiler, found what the bottlenecks were, and tried to remove them. This is the normal way to optimize your program. If it happens to turn out the the bottleneck is a bunch of things waiting to use the processor while there is another one available, then you start thinking about making it multi-threaded. If not, then making it multi-threaded will gain you nothing as far as performance.

  • Re:Adapt (Score:5, Informative)

    by Cassini2 (956052) on Sunday March 22, 2009 @03:53PM (#27291005)

    HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously). I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs.

    The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.

    With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.

    In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.

    The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like .NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.

    The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.

  • by evilviper (135110) on Sunday March 22, 2009 @04:28PM (#27291387) Journal

    Video encoding is one of them, but that one is embarrassingly parallel

    This is most certainly not true. While many video codecs have been multi-threading enabled, they always do so at a significant quality reduction.

    because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.

    Many features of H.264 (like GMC) require a a whole frame, not a quadrant. In practically all lossy video codecs, motion vectors have to be computed as the differential from the previous. And there are endless other examples. Of course there's little point in going into it, because the next time video encoding comes up on /., dozens of other people will make the exact same uninformed statements...

    Just go visit the x264 mailing list and ask the developers why they stopped using slice-based encoding for multithreaded encoding...

    I used to recommend splitting a 2-hour video into four 30-minute parts and feeding each to a single-threaded encoder.

    That would only make ANY sense with fixed bitrate encoding. It can possibly be used in the second-pass of multipass encoding, but that's not trivial to do by any stretch.

  • Re:BeOS (Score:1, Informative)

    by Anonymous Coward on Sunday March 22, 2009 @04:29PM (#27291389)

    or Haiku?
    http://www.haiku-os.org/

  • Re:BeOS (Score:3, Informative)

    by b4dc0d3r (1268512) on Sunday March 22, 2009 @04:35PM (#27291457)

    Looks dead to me, a year ago they posted this:

    With immediate effect, magnussoft Deutschland GmbH has stopped the distribution of magnussoft Zeta 1.21 and magnussoft Zeta 1.5. According to the statement of Access Co. Ltd., neither yellowTAB GmbH nor magnussoft Deutschland GmbH are authorized to distribute Zeta.

    http://www.bitsofnews.com/content/view/5498/44/ [bitsofnews.com]

  • Re:That's a big leap (Score:5, Informative)

    by davecb (6526) * <davec-b@rogers.com> on Sunday March 22, 2009 @04:44PM (#27291551) Homepage Journal

    And if you look at a level lower that the profiler, you find your programs are memory-bound, and getting worse. That's a big part of the push toward multithreaded processors.

    To paraphrase another commentator, they make process switches infinitely fast, so one can keep on using the ALU while your old thread is twiddling its thumbs waiting for a cache-line fill.

    --dave

  • Say what ? (Score:4, Informative)

    by Space cowboy (13680) * on Sunday March 22, 2009 @04:52PM (#27291639) Journal

    Apple have no 2 core intel systems. Period.

    Even the lowly Mac mini is a dual-core system. Every laptop is a dual-core system. The Mac Pro is either 4-core (with hyperthreading for a virtual 8-core) or 8-core (with hyperthreading for a virtual 16-core) system.

    "Better to keep silent and look the fool, rather than speak and remove all doubt"

    Simon.

  • by johannesg (664142) on Sunday March 22, 2009 @04:55PM (#27291661)

    There's not even a way in the C or C++ core language to start a new thread. And with many different third party libraries, there'll never be a reliable standard way to do it.

    Never? A standard, reliable way to do it will be part of C++0x - so that's hardly "never"...

  • Re:Adapt (Score:5, Informative)

    by TheRaven64 (641858) on Sunday March 22, 2009 @05:56PM (#27292287) Journal

    This is simply not true. Assuming both cores are fully loaded, which is the best possible case for dual core, then they will still be performing context switches at the same rate as a single chip if you are running more than one process per core. Even if you had the perfect theoretical case for two cores, where you have two independent processes and never context switch, you could run them much faster on the single-core machine. A single-core 5GHz CPU would have to waste 20% of its time on context switching to be slower than a dual-core 2GHz CPU, while a real CPU will spend less than 1% (and even on the dual-core CPU, most of the time your kernel will be preempting the process every 10ms, checking if anything else needs to run, and then scheduling it again, so you don't save much).

    The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.

  • Re:Adapt (Score:3, Informative)

    by TheRaven64 (641858) on Sunday March 22, 2009 @06:07PM (#27292403) Journal

    Erlang, as mentioned elsewhere, is a great example of a high level functional language which parallelizes much better than C/C++,

    No it isn't. Erlang gains absolutely no benefit in terms of parallelism from being a functional language. All of the concurrency of Erlang comes from the CSP [wikipedia.org] model, while functional languages get theirs via an extension to the lambda calculus.

    The one relevant feature of Erlang when talking about functional languages is that it does not allow mutable data other than the process dictionary. If you want to write parallel code in any language, there is one golden rule you should follow:

    No data shall be both mutable and aliased.

    In Erlang, this is enforced for you; the only mutable data structure is the process dictionary. In functional languages, this is typically handled via something like monads. There is nothing stopping you from enforcing this constraint in an imperative language, however, and if you follow this simple rule then concurrent programming is easy.

  • Re:This is incorrect (Score:4, Informative)

    by Todd Knarr (15451) on Sunday March 22, 2009 @06:22PM (#27292541) Homepage

    Unix didn't for a long time have lightweight preemptive threads because it had, from the very beginning, lightweight preemptive processes. I spent a lot of time wondering why Windows programmers were harping on the need for threads to do what I'd been doing for a decade with a simple fork() call. And in fact if you look at the Linux implementation, there are no threads. A thread is simply a process that happens to share memory, file descriptors and such with it's parent, and that has some games played with the process ID so it appears to have the same PID as it's parent. Nothing new there, I was doing that on BSD Unix back in '85 or so (minus the PID games).

    That was, in fact, one of the things that distinguished Unix from VAX/VMS (which was in a real sense the predecessor to Windows NT, the principal architect of VMS had a big hand in the architecture and internals of NT): On VMS process creation was a massive, time-consuming thing you didn't want to do often, while on Unix process creation was fast and fairly trivial. Unix people scratched their heads at the amount of work VMS people put into keeping everything in a single process, while VMS people boggled at the idea of a program forking off 20 processes to handle things in parallel.

  • by Anonymous Coward on Sunday March 22, 2009 @07:13PM (#27292951)

    So what the author is blathering and foaming about are problems found and solved 20+ years ago. Instead of programmers studying anything, the author should study some. NUMA has been in Linux for close to 10 years. It solves the memory bus problem. Multi-threaded applications solves the problem of using more than 1 core. I do it all the time. Did it yesterday, will likely do it tomorrow. Not every program takes advantage of multiple cores. Quite a few do. Those that scream the need for parallel computing use all of the cores (on my nehalem system it shows up as 8 cores). I do with authors would do the tiniest squeak of research before describing how the world is going to end. Oh well.

  • Re:Say what ? (Score:3, Informative)

    by Space cowboy (13680) * on Sunday March 22, 2009 @07:17PM (#27292997) Journal

    Gaah - the < was swallowed in the statement "Apple have no <2 core intel systems. Period."

    Probably obvious, but to save people nit-picking

  • Re:Adapt (Score:3, Informative)

    by TapeCutter (624760) * on Sunday March 22, 2009 @09:26PM (#27293939) Journal
    "a game renders one map at a time because it's pointless to render other maps until the player made his gameplay decisions and arrived there"

    Rendering is perfect for parallel processing, sure you only want one map at a time but each core can render part of the map independently from other parts of the map.
  • Re:Adapt (Score:3, Informative)

    by SL Baur (19540) <steve@xemacs.org> on Sunday March 22, 2009 @11:21PM (#27294555) Homepage Journal

    Short answer: only one thing I mentioned involved disk I/O, RAM is cheap.

    Not in modern architectures and it depends. Registers are faster than L1 caches. L1 caches are faster than L2 caches, etc.

    See: http://lwn.net/Articles/250967/ [lwn.net] for an excellent discussion about how one can dramatically speed up applications by optimizing memory access.

    And I disagree with the title of this thread - Linux (the kernel at least) is quite well prepared for multicore chips.

  • by Anonymous Coward on Monday March 23, 2009 @04:27AM (#27295709)

    XMOS have been experimenting in this area already. Their language which is an extension to C supports code for parallel processing on multi core chips. See http://www.xmos.com/

  • by Anonymous Coward on Monday March 23, 2009 @04:49AM (#27295787)

    If we are talking about technology... The Linux operating system (monolith kernel is the operating system) works great on CPU's what have more than 4 cores. If the article writer did not know, the Linux OS powers almost all supercomputers etc. The problem is that applications ain't developed to use so many threads etc. The OS just works fine but if the applications can not use multiple threads, you do not gain anything. If you do not run multiple instanses of them.

    If we are talking about marketing lies and misinformation, the "operating system" (actually a _software system_) does not work at all, because usually this "operating system" can not use the multicore CPU's well. Who should we blame?

    Serioysly, Linux just works on multicore CPU's but that is just an operating system. The software systems like Ubuntu, Fedora and Mandriva just ain't working so well.

  • Processing Power?? (Score:1, Informative)

    by Anonymous Coward on Monday March 23, 2009 @08:50AM (#27297377)

    Is this really a concern? How many people are tapping out their CPU? Honestly, 95% of people will never actively use more than 75% of their dual core 2.0 GHz CPU's RAM has and will be the limiting factor on most PC's.

    Also, on the redhat servers I admin, we don't seem to have much trouble with 4x4 CPU's. Are people really saying there is a difference between 16 procs and 4 quad cores? As the OS sees them...

    Odd to even be concerned...

The difficult we do today; the impossible takes a little longer.

Working...