Forgot your password?
typodupeerror
Operating Systems Software Windows Hardware Linux Technology

Windows and Linux Not Well Prepared For Multicore Chips 626

Posted by timothy
from the until-that-invisible-hand-flexes dept.
Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."
This discussion has been archived. No new comments can be posted.

Windows and Linux Not Well Prepared For Multicore Chips

Comments Filter:
  • Adapt (Score:3, Funny)

    by Dyinobal (1427207) on Sunday March 22, 2009 @02:33PM (#27290067)
    Give us a year maybe two.
    • Re:Adapt (Score:5, Interesting)

      by Dolda2000 (759023) <fredrik@dolPASCA ... m minus language> on Sunday March 22, 2009 @03:05PM (#27290487) Homepage

      No, it's not about adaptation. The whole approach currently taken is completely, outright on-its-head wrong.

      To begin with, I don't believe the article about the systems being badly prepared. I can't speak for Windows, but I know for sure that Linux is capable of far heavier SMP operation than 4 CPUs.

      But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads. Many programs would benefit from being able to run just some small operations (like iterations of a loop) in parallel, but just the synchronization work required to wake up even a thread from a pool to do such a thing would greatly exceed the benefit of it.

      People just think about this the wrong way. Let me re-present the problem for you: CPU manufacturers have been finding it harder to scale the clock frequencies of CPUs higher, and therefore they start adding more functional units to CPUs to do more work per cycle instead. Since the normal OoO parallelization mechanisms don't scale well enough (probably for the same reasons people couldn't get data-flow architectures working at large scales back in the 80's), they add more cores instead.

      The problem this gives rise to, as I stated above, is that the unit of parallelism gained by more CPUs is to large to divide the very small units of work that exist among. What is needed, I would argue, is a way to parallelize instructions in the instruction set itself. HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously).

      I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs. This is the kind of the the compiler could do just fine (even the compilers that exist currently -- GCC's SSA representation of programs, for example, is excellent for these kinds of things), by isolating parts of the code in which there are no dependencies in the data-flow, and which could therefore run in parallel, but they need the support in the instruction set to be able to specify such things.

      • Re:Adapt (Score:5, Informative)

        by Dolda2000 (759023) <fredrik@dolPASCA ... m minus language> on Sunday March 22, 2009 @03:28PM (#27290731) Homepage

        Since the normal OoO parallelization mechanisms don't scale well enough

        It hit me that this probably wasn't obvious to everyone, so just to clarify: "OoO", here, stands not for Object-Oriented Something, but for Out-of-Order [wikipedia.org], as in how current, superscalar CPUs work. See also Dataflow architecture [wikipedia.org].

      • Re:Adapt (Score:5, Interesting)

        by Yaa 101 (664725) on Sunday March 22, 2009 @03:36PM (#27290817) Journal

        The final solution is that the processor measures and decides which part of which program must be run parallel and which are better off left alone.
        What else do we have computers for?

      • Re:Adapt (Score:5, Insightful)

        by tftp (111690) on Sunday March 22, 2009 @03:41PM (#27290875) Homepage

        To dumb your message down, CPU manufacturers act like book publishers who want you to read one book in two different places at the same time just because you happen to have two eyes. But a story can't be read this way, and for the same reason most programs don't benefit from several CPU cores. Books are read page by page because each little bit of story depends on previous story; buildings are constructed one floor at a time because each new floor of a building sits on top of lower floors; a game renders one map at a time because it's pointless to render other maps until the player made his gameplay decisions and arrived there.

        In this particular case CPU manufacturers do what they do simply because that's the only thing they know how to do. We, as users, for most tasks would rather prefer a single 1 THz CPU core, but we can't have that yet.

        There are engineering and scientific tasks that can be easily subdivided - this [wikipedia.org] comes to mind - and these are very CPU-intensive tasks. They will benefit from as many cores as you can scare up. But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.

        • Re:Adapt (Score:5, Insightful)

          by Anonymous Coward on Sunday March 22, 2009 @04:48PM (#27291593)

          You're thinking too simply. A single-core system at 5GHz would be less-responsive for most users than a dual-core 2GHz. Here's why:

          While you're playing a game more programs are running in the background - anti-virus, defrag, email, google desktop, etc. Also, any proper, modern game splits it's tasks, e.g. game AI, physics, etc.

          So dual-core is definitely a huge step up from single. So, no, users don't want single-core, they want a faster more responsive pc, which NOW is dual-core. In a few years it will be quad core. Most now hardly benefit from quad core.

          • Re:Adapt (Score:5, Funny)

            by David Gerard (12369) <{ku.oc.draregdivad} {ta} {todhsals}> on Sunday March 22, 2009 @05:11PM (#27291843) Homepage
            Three cores to run GNOME, one core to run Firefox.
            • Re:Adapt (Score:5, Funny)

              by jd (1658) <imipak@nOSPam.yahoo.com> on Sunday March 22, 2009 @09:21PM (#27293909) Homepage Journal

              Three Cores for the Gnome kings under the Gtk,
              Seven for the KDE lords in their halls of X,
              Nine for Emacs Men doomed to spawn,

              • Re:Adapt (Score:5, Funny)

                by Draek (916851) on Sunday March 22, 2009 @11:33PM (#27294613)

                Three Cores for the Mozilla-kings under the GUI,
                Seven for the Gnome-lords in their halls of X,
                Nine for KDE Men doomed to be flamed,
                One for the Free Scheduler on his free kernel
                In the Land of Linux where the SMP lie.
                One Core to rule them all, One Core to find them,
                One Core to bring them all and in the scheduler bind them
                In the Land of Linux where the SMP lie.

                Which is, of course, what will eventually happen if the number of cores keep increasing: we'll need one dedicated exclusively to manage what goes where and when. Which is pretty cool when you think about it ;)

          • Re:Adapt (Score:5, Informative)

            by TheRaven64 (641858) on Sunday March 22, 2009 @05:56PM (#27292287) Journal

            This is simply not true. Assuming both cores are fully loaded, which is the best possible case for dual core, then they will still be performing context switches at the same rate as a single chip if you are running more than one process per core. Even if you had the perfect theoretical case for two cores, where you have two independent processes and never context switch, you could run them much faster on the single-core machine. A single-core 5GHz CPU would have to waste 20% of its time on context switching to be slower than a dual-core 2GHz CPU, while a real CPU will spend less than 1% (and even on the dual-core CPU, most of the time your kernel will be preempting the process every 10ms, checking if anything else needs to run, and then scheduling it again, so you don't save much).

            The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.

          • Re: (Score:3, Insightful)

            by TheNinjaroach (878876)

            A single-core system at 5GHz would be less-responsive for most users than a dual-core 2GHz. Here's why:

            Because you're going to claim it takes more than 20% CPU time for the faster core to switch tasks? That's doubtful, I'll take the 5GHz chip any day.

        • Re:Adapt (Score:5, Insightful)

          by try_anything (880404) on Sunday March 22, 2009 @05:06PM (#27291789)

          But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.

          Yeah, I agree. There are a few rare types of software that are naturally parallel or deal with concurrency out of necessity, such as GUI applications, server applications, data-crunching jobs, and device drivers, but basically every other kind of software is naturally single-threaded.

          Wait....

          Sarcasm aside, few computations are naturally parallelizable, but desktop and server applications carry out many computations that can be run concurrently. For a long time it was normal (and usually harmless) to serialize them, but these days it's a waste of hardware. In a complex GUI application, for example, it's probably fine to use single-threaded serial algorithms to sort tables, load graphics, parse data, and check for updates, but you had better make sure those jobs can run in parallel, or the user will be twiddling his thumbs waiting for a table to be sorted while his quad-core CPU is "pegged" at 25% crunching on a different dataset. Or worse: he sits waiting for a table to be sorted while his CPU is at 0% because the application is trying to download data from a server.

          Your example of building construction is actually a good example in favor of concurrency. Construction is like a complex computation made of smaller computations that have complicated interdependencies. A bunch of different teams (like cores) work on the building at the same time. While one set of workers is assembling steel into the frame, another set of workers is delivering more steel for them to use. Can you imagine how long it would take if these tasks weren't concurrent? Of course, you have to be very careful in coordinating them. You can't have the construction site filled up with raw materials that you don't need yet, and you don't want the delivery drivers sitting idle while the construction workers are waiting for girders. I'm sure the complete problem is complex beyond my imagination. By what point during construction do need your gas, electric, and sewage permits? Will it cause a logistical clusterfuck (contention) if there are plumbers and eletricians working on the same floor at the same time? And so on ad infinitum. Yet the complexity and inevitable waste (people showing up for work that can't be done yet, for example) is well worth having a building up in months instead of years.

          • Re:Adapt (Score:4, Interesting)

            by AmiMoJo (196126) <mojo@NOspAm.world3.net> on Sunday March 22, 2009 @06:42PM (#27292719) Homepage

            So, we can broadly say that there are three areas where we can parallelise.

            First you have the document level. Google Chrome is a good example of this - first we had the concept of multiple documents open in the same program, now we have the concept for a separate thread for each "document" (or tab in this case). Games are also moving ahead in this area, using separate threads for graphics, AI, sound, physics and so on.

            Then you have the OS level. Say the user clicks to sort a table of data into a new order, the OS can take care of that. It's a standard part of the GUI system, and can be set off as a separate thread. Of course, some intelligence is required here as it's only worth spawning another thread if the sort is going to take some appreciable amount of time.

            At the bottom you have the algorithm level, which is the hard one. So far this level has got a lot of attention, but the others relatively little. The first two are the low hanging fruit, which is where people should be concentrating.

        • Re:Adapt (Score:5, Funny)

          by nmb3000 (741169) <nmb3000@that-google-mail-site.com> on Sunday March 22, 2009 @05:21PM (#27291945) Homepage Journal

          To dumb your message down, CPU manufacturers act like book publishers [...]

          What is this "books" crap? Pft, I remember when car analogies were good enough for everyone. Now you have to get all fancy. Let me try and explain it more clearly:

          CPUs are like cars. Intel and Friends haven't been able to keep increasing the velocity they can safely and reliably run, so instead of relying on increased speed to get more people from point A to point B, they are instead starting to look at parallelization as a means to achieve better performance.

          Now you are chopped up into 10 pieces and FedEx'd to your destination with 100 other people. Pieces may go by road, rail, air, or ship and thus overall capacity--"bandwidth" you might say--of the lanes of travel has been increased.

          The only problem is that the people who make use of this new technique ("programmers", that is) have a hard time chopping you up in such a way that you can be put back together again. Usually it's a bit of a mess and more trouble that it's worth, thus we just keep driving our old-fashioned cars at normal speeds while adding lanes to the roads.

          • Re: (Score:3, Insightful)

            by mgblst (80109)

            A better but less humorous analogy would be to consider that Intel and co can't keep increasing the top speed of a car, so they are putting more seats into your car. This works OK when you have lots of people to transport, but when you only have 1 or two, it doesn't make the journey any faster. The problem is, most journeys only consist of one or two people. What the article is suggesting is that we implement some sort of car-sharing initiative, we stop taking so many cars to the same destination. Or a bus!

        • Re: (Score:3, Insightful)

          by gbjbaanb (229885)

          Yeah, I reckon you've got the reason things are "single-threaded" by design. So the solution is to start getting creative with sections of programs and not the whole.

          For example, if you're using OpenMP to introduce parallelisation, you can easily make loops run in multi-core mode, and you'll get compiler errors if you try to parallelise loops that can't be broken down like that.

          Like your building analogy - sure, you have to finish one floor before you can put the next one on, but once the floors are up, you

        • Re: (Score:3, Interesting)

          by giorgist (1208992)
          You havn't seen bulding go up. You don't place a brick render it, paint it hang a picture frame and go to the next one.

          A multi story building has a myriad of things happening at the same time. If only computers were as parralel processing.
          If you have 100 or 1000 people working on a building, each is an independant process that shares resources.

          It is simple, 8 core CPUs is a solution that arrived before the problem. A good 10 year old computers can do most of todays
          office work.
        • Re: (Score:3, Informative)

          by TapeCutter (624760) *
          "a game renders one map at a time because it's pointless to render other maps until the player made his gameplay decisions and arrived there"

          Rendering is perfect for parallel processing, sure you only want one map at a time but each core can render part of the map independently from other parts of the map.
        • by coryking (104614) * on Sunday March 22, 2009 @09:38PM (#27293987) Homepage Journal

          But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.

          The fact that all we do is sequential tasks on our computer means we are still pretty stupid when it comes to "computing". If you look outside your CPU, you'll see the rest of the computers on this planet are massively parallel and do tons and tons of very complex operations far quicker than the computer running on either one of our desks.

          Most of the computers on the planet are organic ones inside of critters of all shapes and sizes. I dont see those guys running around with some context-switching, mega-fast CPU, do you?**. All the critters I see are using parallel computers with each "core" being a rather slow set of neurons.

          Basically, evolution of life on earth seems to suggest that the key to success is going parallel. Perhaps we should take the hint from nature.

          ** unless you count whatever the hell consciousness itself is... "thinking" seems to be single-threaded, but uses a bunch of interrupt hooks triggered by lord knows what running under the hood.

          • Re: (Score:3, Interesting)

            by tftp (111690)

            If you look outside your CPU, you'll see the rest of the computers on this planet are massively parallel

            You don't even need to look outside of your computer - it has many microcontrollers, each having a CPU, to do disk I/O, video, audio - even a keyboard has its own microcontroller. This is not far from a mouse being able to think about escape and run at the same time - most mechanical functions in critters are highly automated (a headless chicken is an example.) I don't call it multithreading because th

            • Re: (Score:3, Interesting)

              by coryking (104614) *

              Logically thinking, any single thought can't be easily parallelized, but why couldn't we think two thoughts at the same time?

              Yes, but there is increasing evidence (dont ask me to cite :-) that many of our thoughts are something that some background process has been "thinking about" long (i.e. seconds or minutes) before our actual conscious self does. There are many examples of this in Malcolm Gladwell's "Blink", though I dont feel much like citing them. Part of that book, I think, basically says that we

              • Re: (Score:3, Funny)

                by coryking (104614) *

                our train of though it single-threaded, but that doesn't mean our train of though isn't just a byproduct

                And sometimes, even, our background grammar checker misses things that our background finger-controller mis-types while on auto pilot. thought/though, thing/think are stroke-patterns that my hand-controller mixes up a lot and since this isn't something super-formal, the top-part of my brain never catches.

      • Re:Adapt (Score:5, Insightful)

        by Sentry21 (8183) on Sunday March 22, 2009 @03:42PM (#27290883) Journal

        This is the sort of thing I like about Apple's 'Grand Central'. The idea behind is that instead of assigning a task to a processor, it breaks up a task into discrete compute units that can be assigned wherever. When doing processing in a loop, for example, if each iteration is independent, you could make each iteration a separate 'unit', like a packet of computation.

        The end result is that the system can then more efficiently dole out these 'packets' without the programmer having to know about the target machine or vice-versa. For some computation, you could use all manner of different hardware - two dual-core CPUs and your programmable GPU, for example - because again, you don't need to know what it's running on. The system routes computation packets to wherever they can go, and then receives the results.

        Instead of looking at a program as a series of discrete threads, each representing a concurrent task, it breaks up a program's computation into discrete chunks, and manages them accordingly. Some might have a higher priority and thus get processed first (think QoS in networking), without having to prioritize or deprioritize an entire process. If a specific packet needs to wait on I/O, then it can be put on hold until the I/O is done, and the CPU can be put back to work on another packet in the meantime.

        What you get in the end is a far more granular, more practical way of thinking about computation that would scale far better as the number of processing units and tasks increases.

        • Re: (Score:3, Interesting)

          by Trepidity (597)

          The problem is still the efficiency, though. There are lots of ways to mark units of computation as "this could be done separately, but depends on Y"--- OpenMP provides a bunch of them, for example, and there's been proposals dating back to the 80s [springer.com], probably earlier. The problem is figuring out how to implement that efficiently, though, so that the synchronization overhead doesn't dominate the parallelization gains. Does the system spawn new threads? Maintain a pool of worker threads and feed thunks to them

        • Re:Adapt (Score:5, Insightful)

          by fractoid (1076465) on Sunday March 22, 2009 @11:32PM (#27294607) Homepage

          This is the sort of thing I like about Apple's 'Grand Central'.

          What's this 'grand central' thing? From a few brief Google searches it appears to be a framework for using graphics shaders to offload number crunching to the video card. It'd be nice if they'd stick (at least for technical audiences) to slightly more descriptive and less grandiose labels.

          <rant>
          That's always been my main peeve with Apple, they give opaque, grandiloquent names to standard technologies, make ridiculous performance claims, then set their foaming fanboys loose to harass those of us who just want to get the job done. Remember "AltiVEC" (which my friend swore could burn a picture of Jesus's toenails onto a piece of toast on the far side of the moon with a laser beam comprised purely of blindingly fast array calculations) which turned out to just be a slightly better MMX-like SIMD addon?

          Or the G3/G4 processors which lead us to be breathlessly sprayed with superlatives for years until Apple ditched them for the next big thing - Intel processors! Us stupid, drone-like "windoze" users would never see the genius in using Intel proce... oh wait. No, no wait. We got the same "oooh the Intel Mac is 157 times faster than an Intel PC" for at least six months until 'homebrew' OSX finally proved that the hardware is exactly the friggin same now. For a while, thank God, they've been reduced to lavishing praise on the case design and elegant headphone plug placement. It looks like that's coming to an end, though.
          </rant>

      • Re:Adapt (Score:5, Informative)

        by Cassini2 (956052) on Sunday March 22, 2009 @03:53PM (#27291005)

        HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously). I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs.

        The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.

        With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.

        In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.

        The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like .NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.

        The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.

        • Re:Adapt (Score:4, Interesting)

          by Dolda2000 (759023) <fredrik@dolPASCA ... m minus language> on Sunday March 22, 2009 @04:38PM (#27291481) Homepage

          All that which you say is certainly true, but I would still argue that EPIC's greatest problem is its hard parallelism limit. True, it's not as hard as I tried to make it out, since an EPIC instruction bundle has its non-dependence flag, but you cannot, for instance, make an EPIC CPU break off and execute two sub-routines in parallel. Its parallelism lies only in very small spatial window of instructions.

          What I'd like to see is, rather, that the CPU can implement a kind of "micro-thread" function, that would allow two larger codepaths simultaneously -- larger than what EPIC could handle, but quite possibly still far smaller than what would be efficient to distribute on OS-level threads, with all the synchronization and scheduler overhead that would mean.

        • Re: (Score:3, Insightful)

          by bertok (226922)

          I think the consensus was that making compilers emit efficient VLIW for a typical procedural language such as C is very hard. Intel spend many millions on compiler research, and it took them years to get anywhere. I heard of 40% improvements in the first year or two, which implies that they were very far from ideal when they started.

          To achieve automatic parallelism, we need a different architecture to classic "x86 style" procedural assembly. Programming languages have to change too, the current crop are too

      • Re:Adapt (Score:4, Insightful)

        by init100 (915886) on Sunday March 22, 2009 @04:09PM (#27291193)

        To begin with, I don't believe the article about the systems being badly prepared. I can't speak for Windows, but I know for sure that Linux is capable of far heavier SMP operation than 4 CPUs.

        My take on the article is that it is referring to applications provided with or at least available for the systems in question, and not actually the systems themselves. In other words, it takes the user view, where the operating system is so much more than just the kernel and the other core subsystems.

        But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads.

        Actually, in Linux (and likely other *nix systems), with command lines involving multiple pipelined commands, the commands are executed in parallel, and are thus being scheduled on different processors/cores if available. This is a simple way of using the multiple cores available on concurrent systems, and thus, advanced programming is not always necessary to take advantage of the power of multicore chips.

      • Re: (Score:3, Interesting)

        by erroneus (253617)

        Multi-core processing is one thing but access to multiple chunks of memory and peripherals are also keeping computers slow. After playing with running machines from PXE boot and NFS rooted machines, I was astounded at how fast those machines performed. Then I realized that the kernel and all wasn't being delayed waiting on local hardware for disk I/O.

        It seems to me, when NAS and SAN are used, things perform a bit better. I wonder what would happen if such control and I/O systems were applied into the sam

    • It's already there (Score:4, Insightful)

      by wurp (51446) on Sunday March 22, 2009 @03:14PM (#27290575) Homepage

      Seriously, no one has brought up functional programming, LISP, Scala or Erlang? When you use functional programming, no data changes and so each call can happen on another thread, with the main thread blocking when (& not before) it needs the return value. In particular, Erlang and Scala are specifically designed to make the most of multiple cores/processors/machines.

      See also map-reduce and multiprocessor database techniques like BSD and CouchDB (http://books.couchdb.org/relax/eventual-consistency).

    • Re:Adapt (Score:5, Insightful)

      by Cassini2 (956052) on Sunday March 22, 2009 @03:19PM (#27290637)

      Give us a year maybe two.

      I think this problem will take longer than a year or two to solve. Modern computers are really fast. They solve simple problems, almost instantly. A side-effect of this, is that if you underestimate the computational power required for the problem at hand, then you are likely to be off by large amounts.

      If you implement an order n-squared algorithm, O(n^2), on a 6502 (Apple II), if n was larger than a few hundred, you were dead. Many programmers wouldn't even try implementing hard algorithms on the early Apple II computers. On the other hand, a modern processor might tolerate O(n^2) algorithms with n larger than 1000. Programmers can try solving much harder problems. However, the programmers ability to estimate and deal with computational complexity has not changed since the early days of computers. Programmers use generalities. They use ranges: like n will be between 5 and 100, or n will be between 1000 and 100,000. With modern problems, n=1000 might mean the problem can be solved on a netbook, and n=100,000 might require a small multi-core cluster.

      There aren't many programming platforms out there that scale smoothly between applications deployed on a desktop, to applications deployed on a multi-core desktop, and then to clusters of multi-core desktops. Perhaps most worrying, is that the new programming languages that are coming out, are not particularly useful for intense data analysis. The big examples of this for me are: .NET and Functional Languages. .NET deployed at about the same time multi-core chips showed up, and has minimal support for it. Functional languages may eventually be the solution, but for any numerically intensive application, tight loops of C code are much faster.

      The other issue with multi-core chips, is that as a programmer, I have two solutions to making my code go faster:
      1. Get out the assembly print outs and the profiler, and figure out why the processor is running slow. Doing this, helps every user of the application, and works well with almost any of the serious compiled languages (C, C++). Sometimes, I can get a 10:1 speed improvement.(*) It doesn't work so well with Java, .NET, or many functional languages, because they use run-time compilers/interpreters and don't generate assembly code.
      2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.

      The problem with both of the above approaches, is that from a tools perspective, they are the worst choice for multi-core optimizations. Approach 1 will force me into using C and C++, which doesn't even handle threads really well. In particular, C and C++ lacks an easy implementation of Software Transactional Memory, NUMA, and clusters. This means that approach 2 may require a complete software redesign, and possibly either a language change or a major change in the compilation environment. Either way, my days of fun loving Java and .NET code are coming to a sudden end.

      I just don't think there is any easy way around it. The tools aren't yet available for easy implementation of fast code that scales between the single-core assumption and the multi-core assumption in a smooth manner.

      Note: * - By default, many programmers don't take advantage of many features that may increase the speed of an algorithm. Built-in special purpose libraries, like MMX, can dramatically speed up certain loops. Sometimes loops contain a great deal of code that can be eliminated. Maybe a function call is present in a tight loop. Anti-virus software can dramatically affect system speed. Many little things can sometimes make big differences.

      • by Tiger4 (840741)

        2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.

        I think I agree with you, BUT... don't fall into the old trap: If ten machines can do the job in 1 month, 1 machine can do the job in 10 months. But it doesn't necessarily follow that if one machine can do the job in 10 months, 10 machines can do the job in 1 month.

        Also, the problem with runtime interpreters is not that they don't generate assembly code. The problem is that it is harder to get at the underlying code that is really executing. That code could be optimized if you could see it. But seeing i

    • Re:Adapt (Score:5, Funny)

      by camperslo (704715) on Sunday March 22, 2009 @04:26PM (#27291359)

      The programmers of Slashdot are ready for multiple cores and threads. There is no problem.

      When performing a number of operations in parallel the key is to simply ignore the results of each operation.
      For operations that would have used the result of another as input simply use what you think the result might be or what you wish it was.

      The programmers of Slashdot already have the needed skills for such programming as the mental processes are the same ones that enable discussion of TFAs without reading them.

  • by Microlith (54737) on Sunday March 22, 2009 @02:35PM (#27290099)

    So basically yet another tech writer finds out that a huge number of applications are still single threaded, and that it will be a while before we have applications that can take advantage of the cores that the OS isn't actively using at the moment. Well, assuming you're running a desktop and not a server.

    This isn't a performance issue with regards to Windows or Linux, they're quite adept at handling multiple cores. They just don't need that much themselves and the applications run these days, individually, don't need much more than that either.

    So yes, applications need parallelization. The tools for it are rudimentary at best. We know this. Nothing to see here.

    • Re: (Score:3, Interesting)

      by thrillseeker (518224)
      Did you ever follow the Occam language? It seemed to have parallelization intrinsic, but it never went anywhere.
      • Re: (Score:3, Informative)

        by 0123456 (636235)

        Did you ever follow the Occam language? It seemed to have parallelization intrinsic, but it never went anywhere.

        Occam was heavily tied into the Transputer, and without the transputer's hardware support for message-passing, it's a bit of a non-starter.

        It also wasn't easy to write if you couldn't break your application down into a series of simple processes passing messages to each other. I suspect it would go down better today now people are used to writing object-oriented code, which is a much better match to the message-passing idea than the C code that was more common at the time.

    • by phantomfive (622387) on Sunday March 22, 2009 @02:59PM (#27290423) Journal
      From the article:

      The onus may ultimately lie with developers to bridge the gap between hardware and software to write better parallel programs......They should open up data sheets and study chip architectures to understand how their code can perform better, he said.

      Here's the problem, most programs spend 99% of its time waiting. MOST of that is waiting for user input. Part of it is waiting for disk access (as mentioned in the AnandTech story [slashdot.org], the best thing you can do to speed up your computer is get a faster hard drive/SSD). A miniscule part of it is spent in the processor. If you don't believe me, pull out a profiler and run it on one of your programs, it will show you where things can be easily sped up.

      Now, given that the performance of most programs is not processor bound, what is there to gain by parallelizing your program? If the performance gain were really that significant, I would already be writing my program with threads, even with the tools we have now. The fact of the matter is in most cases, there is really no point to writing your program in a parallel manner. This is something a lot of the proponents of Haskell don't seem to understand, that even if their program is easily paralellizable, the performance gain is not likely to be noticeable. Speeding up hard drives will make more of a difference to performance in most cases than adding cores.

      I for one am certainly not going to be reading chip data sheets unless there's some real performance benefit to be found. If there's enough benefit, I may even write parts in assembly, I can handle any ugliness. But only if there's a benefit from doing so.

      • That's a big leap (Score:4, Insightful)

        by SuperKendall (25149) on Sunday March 22, 2009 @03:20PM (#27290641)

        If you don't believe me, pull out a profiler and run it on one of your programs, it will show you where things can be easily sped up.

        Now, given that the performance of most programs is not processor bound

        That's a pretty big leap, I think.

        Yes a lot of todays apps are more user bound than anything. But there are plenty of real-world apps that people use that are still pretty processor bound - Photoshop, and image processing in general is a big one. So can be video, which starts out disk bound but is heavily processor bound as you apply effects.

        Even Javascript apps are processor bound, hence Chrome...

        So there's still a big need for understanding how to take advantage of more cores - because chips aren't really getting faster these days so much as more cores are being added.

        • Re: (Score:3, Informative)

          by phantomfive (622387)

          So there's still a big need for understanding how to take advantage of more cores - because chips aren't really getting faster these days so much as more cores are being added.

          OK, so we can go into more detail. For most programs, parallelization will do essentially nothing. There are a few programs that can benefit from it, as you've mentioned. But those programs are already taking advantage of them, not only do video encoding programs use multiple cores, some can even farm the process out over multiple systems. So it isn't a matter of programmers being lazy, or tools not being available, it's a matter of in most cases, multiple cores won't make a difference. If you run wind

        • Re:That's a big leap (Score:5, Informative)

          by davecb (6526) * <davec-b@rogers.com> on Sunday March 22, 2009 @04:44PM (#27291551) Homepage Journal

          And if you look at a level lower that the profiler, you find your programs are memory-bound, and getting worse. That's a big part of the push toward multithreaded processors.

          To paraphrase another commentator, they make process switches infinitely fast, so one can keep on using the ALU while your old thread is twiddling its thumbs waiting for a cache-line fill.

          --dave

      • Re: (Score:3, Insightful)

        by caerwyn (38056)

        This is true to a point. The problem is that, in modern applications, when the user *does* do something there's generally a whole cascade of computation that happens in response- and the biggest concern for most applications is that app appear to have short latency. That is, all of that computation happens as quickly as possible so the app can go back to waiting for user input.

        There's a lot of gain that can be gotten by threading user input responses in many scenarios. Who cares if the user often waits 5 mi

    • by ari wins (1016630) on Sunday March 22, 2009 @03:21PM (#27290653)
      I almost modded you Redundant to help get your point across.
  • by mysidia (191772) on Sunday March 22, 2009 @02:39PM (#27290139)

    Multiple virtual machines on the same piece of metal, with a workstation hypervisor, and intelligent balancing of apps between backends.

    Multiple OSes sharing the same cores. Multiple apps running on the different OSes, and working together.

    Which can also be used to provide fault tolerance... if one of the worker apps fails, or even one of the OSes fails, your processor capability is reduced, a worker app in a different OS takes over, use checkpointing procedures, and shared state, so the apps don't even lose data.

    You should even be able to shutdown a virtual OS for windows updates without impact, if the apps that arise get designed properly...

  • Huh? (Score:5, Funny)

    by Samschnooks (1415697) on Sunday March 22, 2009 @02:39PM (#27290141)

    ...programmers are to blame for that

    The development tools aren't available and research is only starting."

    Stupid programmers! Not able to develop software without the tools! In my day we wrote our own tools - in the snow, uphill, both ways! We didn't need no stink'n vendor to do it for us - and we liked it that way!

  • by davecb (6526) * <davec-b@rogers.com> on Sunday March 22, 2009 @02:40PM (#27290145) Homepage Journal

    Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.

    Secondly. it's the same FUD as we heard from uniprocessor manufacturers when multiprocessors first came out: this new "symmetrical multiprocessing" stuff will never work, it'll bottleneck on locks.

    The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.

    The ones in the middle are the problem, as they need to coordinate to some degree, and don't do that well. It's a research area in computer science, and one of the interesting areas is in transactional memory.

    That's what the folks at the Multicore Expo are worried about: Linux itself is fine, and has been for a while.

    --dave

  • by mcrbids (148650) on Sunday March 22, 2009 @02:41PM (#27290175) Journal

    Languages like PHP/Perl, as a rule, are not designed for threading - at ALL. This makes multi-core performance a non-starter. Sure, you can run more INSTANCES of the language with multiple cores, but you can't get any single instance of a script to run any faster than what a single core can do.

    I have, so, so, SOOOO many times wished I could split a PHP script into threads, but it's just not there. The closest you can get is with (heavy, slow, painful) forking and multiprocess communication through sockets or (worse) shared memory.

    Truth be told, there's a whole rash of security issues through race conditions that we'll soon have crawling out of nearly every pore as the development community slowly digests multi-threaded applications (for real!) in the newly commoditized multi-CPU environment.

  • by Anonymous Coward on Sunday March 22, 2009 @02:42PM (#27290185)

    "The development tools aren't available and research is only starting"

    Hardly. Erlang's been around 20 years. Newer languages like Scala, Clojure, and F# all have strong concurrency. Haskell has had a lot of recent effort in concurrency (www.haskell.org/~simonmar/papers/multicore-ghc.pdf).

    If you prefer books there's: Patterns for Parallel Programming, the Art of Multiprocessor Programming, and Java Concurrency in Practice, to name a few.

    All of these are available now, and some have been available for years.

    The problem isn't that tools aren't available, it's that the programmers aren't preparing themselves and haven't embraced the right tools.

  • BeOS (Score:5, Interesting)

    by Snowblindeye (1085701) on Sunday March 22, 2009 @02:42PM (#27290191)

    Too bad BeOS died. One of the axioms the developers had was 'the machine is a multi processor machine', and everything was built to support that.

    Seems like they were 15 years ahead of their time. But, on the other hand, too late to establish an other OS in a saturated market. Pity, really.

    • Re: (Score:3, Informative)

      by yakumo.unr (833476)
      So you missed Zeta then ? http://www.zeta-os.com/cms/news.php [zeta-os.com] (change to English via the dropdown on the left)
      • Re: (Score:3, Informative)

        by b4dc0d3r (1268512)

        Looks dead to me, a year ago they posted this:

        With immediate effect, magnussoft Deutschland GmbH has stopped the distribution of magnussoft Zeta 1.21 and magnussoft Zeta 1.5. According to the statement of Access Co. Ltd., neither yellowTAB GmbH nor magnussoft Deutschland GmbH are authorized to distribute Zeta.

        http://www.bitsofnews.com/content/view/5498/44/ [bitsofnews.com]

    • Re: (Score:3, Interesting)

      It may have been an axiom, but really, what did BeOS do (or want to do) that Linux doesn't do now?

      The Linux OS has been scaled to thousands of CPUs. Sure, most applications don't benefit from multi-processors, but that'd be true in BeOS, too.

      I'd honestly like to know if there is some design paradigm that was lost with BeOS that isn't around today.

  • by Anonymous Coward on Sunday March 22, 2009 @02:46PM (#27290259)

    The quote presented in the summary is nowhere to be found in the linked article. To make matters worse, the summary claims that linux and windows aren't designed for multicore computers but the linked article only claims that some applications are not designed to be multi-threaded or running multiple processes. Well, who said that every application under the sun must be heavily multi-threaded or spawning multiple processes? Where's the need for a email client to spawn 8 or 16 threads? Will my address book be any better if it spans a bunch of processes?

    The article is bad and timothy should feel bad. Why is he still responsible for any news being posted on slashdot?

  • by Troy Baer (1395) on Sunday March 22, 2009 @02:57PM (#27290411) Homepage

    The /. summary of TFA is almost exquisitely bad. It's not Window or Linux that's not ready for multicore (as both have supported multi-processor machines for on the order of a decade or more), but rather the userspace applications that aren't ready. The reason is simple: Parallel programming is rather hard, and historically most ISVs have haven't wanted to invest in it because they could rely on the processors getting faster every year or two... but no longer.

    One area where I disagree with TFA is the claimed paucity of programming models and tools. Virtually every OS out there supports some kind of concurrent programming model, and often more than one depending on what language is used -- pthreads [wikipedia.org], Win32 threads, Java threads, OpenMP [openmp.org], MPI [mpi-forum.org] or Global Arrays [pnl.gov] on the high end, etc. Most debuggers (even gdb) also support debugging threaded programs, and if those don't have enough heft, there's always Totalview [totalview.com]. The problem is that most ISVs have studiously avoided using any of these except when given no other choice.

    --t

  • by Pascal Sartoretti (454385) on Sunday March 22, 2009 @02:59PM (#27290431)
    Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores.

    So what? If I had a 32 core system, at least each running process (even if single-threaded) could have a core just for itself. Only a few basic applications (such as a browser) really need to be designed for multiples threads.
  • by tyler_larson (558763) on Sunday March 22, 2009 @03:15PM (#27290597) Homepage
    If you spend more time assigning blame than you do describing the problem, then clearly you don't have anything insightful to say.
  • by Todd Knarr (15451) on Sunday March 22, 2009 @03:37PM (#27290827) Homepage

    Part of the problem is that tools do very little to help break programs down into parallelizable tasks. That has to be done by the programmer, they have to take a completely different view of the problem and the methods to be used to solve it. Tools can't help them select algorithms and data structures. One good book related to this was one called something like "Zen of Assembly-Language Optimization". One exercise in it went through a long, detailed process of optimizing a program, going all the way down to hand-coding highly-bummed inner loops in assembly. And it then proceeded to show how a simple program written in interpreted BASIC(!) could completely blow away that hand-optimized assembly-language just by using a more efficient algorithm. Something similar applies to multi-threaded programming: all the tools in the world can't help you much if you've selected an essentially single-threaded approach to the problem. They can help you squeeze out fractional improvements, but to really gain anything you need to put the tools down, step back and select a different approach, one that's inherently parallelizable. And by doing that, without using any tools at all, you'll make more gains than any tool could have given you. Then you can start applying the tools to squeeze even more out, but you have to do the hard skull-sweat first.

    And the basic problem is that schools don't teach how to parallelize problems. It's hard, and not everybody can wrap their brain around the concept, so teachers leave it as a 1-week "Oh, and you can theoretically do this, now let's move on to the next subject." thing.

    • Re: (Score:3, Insightful)

      by EnglishTim (9662)

      And the basic problem is that schools don't teach how to parallelize problems. It's hard, and not everybody can wrap their brain around the concept...

      And there's more to it than that; If a problem is hard, it's going to take longer to write and much longer to debug. Often it's just not worth investing the extra time, money and risk into doing something that's only going to make the program a bit faster. If we proceed to a future where desktop computers all have 256 cores, the speed advantage may be worth it but currently it's a lot of effort without a great deal of gain. There's probably better ways that you can spend your time.

  • by FlyingGuy (989135) <`flyingguy' `at' `gmail.com'> on Sunday March 22, 2009 @04:10PM (#27291211)

    it is the answer to the question that no one asked...

    In a real world application, as others have mentioned pretty much all of a programs time is spent in an idle loop waiting something to happen and in almost all circumstances it is input from the user in whatever form, mouse, keyboard, etc.

    So lets say it is something life Final Cut. Now to be sure when someone kicks of a render this is an operation that can be spun off on its own thread or its own process, freeing up the main process loop to respond to other things that the user might be doing, but that is where the rubber really hits the road is user input. The user could do something that affects the process that was just spun off, either as a separate thread or process on the same core or any other number of cores so you have to keep track of what the user is doing in the context of things that have been farmed out into other cores/processes/threads.

    Enter the OS.. Take your pick since it really does not matter which OS we are talking about, they all do the same basic things, perhaps differently, but they do. How does an OS designer make sure any of say 16 cores ( dual 8 core processors) are actually well and fairly utilized? Would it be designed to use a core to handle each of the main functions of the OS, lets say Drive Access, Com Stack pick your protocol here, Video Processing etc., or should it just run a scheduler like those that they now run which farms out thread processing based on priority? Is there really any priority scheme for multiple cores that could run say hundreds of threads / processes each? And what about memory? A single core machine that is say truly 64 bit can handle a very large amount of memory and that single core controls and has access to all that ram at its whim ( DMA not withstanding ), but what do you do now that you have 16 cores all wanting to use that memory, do we create a scheduler to schedule access from 16 different demanding stand alone processors or do we simply give each core a finite memory space and then have to control the movement of data from each memory space to another, since a single process thread ( handling the main UI thread for a program ) has to be aware of when something is finished on one core and then get access to that memory to present results either as data written to say a file or written into video memory for display?

    I submit that the current paradigm of SMP is inadequate for these tasks and must be rethought to take advantage of this new hardware. I think a more efficient approach is that each core detected would be fired up with its own monitor stack as a place to start so that the scheduling is based upon the feedback from each core. The monitor program would be able to ensure that the core it is responsible for is optimized for the kind of work that is presented. This concept while complicated could be implemented and serve as a basis for further development in this very complex space.

    In the terms of "super computers" this has been dealt with but in a very different methodology that I do not think lends itself to general computing. Deep Blue, Cray's and things like that aren't really relevant in this case since those are mostly very custom designs to handle a single purpose and are optimized for things like Chess or Weather Modeling, Nuclear Weapons study where the problem are already discretely chunked out with a known set of algorithms and processes. General purpose computing on the other hand is like trying to heard cats from the OS point of view since you never really know what is going to be demanded and how.

    OS designers and user space software designers need to really break this down and think it all the way through before we get much further or all this silicon is not going to used well or efficiently.

  • by hazydave (96747) on Sunday March 22, 2009 @04:20PM (#27291309)

    The idea of an OS and/or suppoet tools handling the SMP problem is nothing more than a crutch for bad programming.

    In fact, anyone who grew up with a real multitheaded, multitasking OS is already writing code that will scale just dandy to 8 cores and beyond. When you accept that a thread is nothing more or less than a typical programming construct, you simply write better code. This is no more or less an amazing thing than when regular programmers embraced subroutines or structures.

    This was S.O.P. back in the late 80s under the AmigaOS, and enhanced in the early/mid 90s under BeOS. This in not new, and not even remotely tied to the advent of multicore CPUs.

    The problem here is simple: UNIX and Windows. Windows had fake multitasking for so long, Windows programmers barely knew what you could do when you had "thread" in the same toolkit as "subroutine", rather than it being something exotic. UNIX, as a whole, didn't even have lightweight preemptive threads until fairly recently, and UNIX programmers are only slowly catching up.

    However, neither of these is even slightly an OS problem... it's an application-level problem. If programmers continue to code as if they had a 70s-vintage OS, they're going to think in single threads and suck on 8-core CPUs. If programmers update themselves to state-of-the-1980s thinking, they'll scale to 8-cores and well beyond.

    • Re:This is incorrect (Score:4, Informative)

      by Todd Knarr (15451) on Sunday March 22, 2009 @06:22PM (#27292541) Homepage

      Unix didn't for a long time have lightweight preemptive threads because it had, from the very beginning, lightweight preemptive processes. I spent a lot of time wondering why Windows programmers were harping on the need for threads to do what I'd been doing for a decade with a simple fork() call. And in fact if you look at the Linux implementation, there are no threads. A thread is simply a process that happens to share memory, file descriptors and such with it's parent, and that has some games played with the process ID so it appears to have the same PID as it's parent. Nothing new there, I was doing that on BSD Unix back in '85 or so (minus the PID games).

      That was, in fact, one of the things that distinguished Unix from VAX/VMS (which was in a real sense the predecessor to Windows NT, the principal architect of VMS had a big hand in the architecture and internals of NT): On VMS process creation was a massive, time-consuming thing you didn't want to do often, while on Unix process creation was fast and fairly trivial. Unix people scratched their heads at the amount of work VMS people put into keeping everything in a single process, while VMS people boggled at the idea of a program forking off 20 processes to handle things in parallel.

  • by gooneybird (1184387) on Monday March 23, 2009 @07:45AM (#27296685)
    "The problem my dear programmer, as you so elequently put, is one of choice.."

    Seriously. I have been involved with software development from 8-bit pics to Cluster's spanning wans and everything in between for the past 20 years or so.

    Multiprocessing involves coordination between the processes. It doesn't matter (too much) whether it's separate cores or separate silicon. On any given modern OS there are plenty of examples of multiprocessor execution: Hard drives each have a processor, video cards each have a processor, USB controllers have a processor. All of these work because there is a well-defined API between them and the OS - a.k.a device drivers. People that write good device drivers (and kernel code) understand how an OS works. This is not generally true of the broader developer population.

    Developer's keep blaming the CPU manufactures' that it's their fault. It's not. What prevents parallel processing from becoming mainstream is the lack of a standard inter-process communications mechanism (at the language level) that abstracts a lot of the dirty little details that are needed. Once the mechanism is in place, then people will start using it. I am not referring to semaphores and mutexes. These are synchronization mechanisms, NOT (directly) communication mechanisms... I am not talking about queues either - too much leeway on their use. Sockets would be closer, but most people think of sockets for "network" applications. They should be thinking of them as "distributed applications". As in distrbuted across cores. As an example, Microsoft just recently started to demonstrate that they "get it" because with the next release of VS. It will have a messaging library.

    choice:

    At this time there are too many different ways to implement multi-threaded/multi-processor aware software. Each implementation has possible bugs - race conditions, lockups, priority inversion, etc. The choices need to be narrowed

    Having a standard (language & OS) API is the key to providing a framework for developer's to use, yet still allowing them the freedom to customize for specific needs. So the OS needs an interface for setting CPU/core preferences and the language needs to provide the API. Once there is an API, developer's can "wrap their minds" around the concept and then things will "take off". As I stated previously, I prefer the "message box" mechansims simply because they port easily, are easy to understand and provide for a very loosely coupled interaction. All good tenants of a multi-threaded/multi-processor implementation.

    Danger Will Robinson:

    One thing that I fear is that once the concept catches on, it will be overused or abused. People will start writing threads and processes that don't do enough work to justify the overhead. Everyone who starts writing programs will "advertise" that it's "multi-threaded", as if this somehow automatically indicates quality and/or "better" software...Not.

One possible reason that things aren't going according to plan is that there never was a plan in the first place.

Working...