Please create an account to participate in the Slashdot moderation system


Forgot your password?
Programming Hardware Science Technology

MIT Randomizes Tasks To Speed Massive Multicore Processors 63

itwbennett writes Researchers at the Massachusetts Institute of Technology have created a data structure that they claim can help large multicore processors churn through their workloads more effectively. Their trick? Do away with the traditional first-come, first-served work queue and assign tasks more randomly. The SprayList algorithm allows processors with many cores to spread out their work so they don't stumble over one another, creating bottlenecks that hamper performance.
This discussion has been archived. No new comments can be posted.

MIT Randomizes Tasks To Speed Massive Multicore Processors

Comments Filter:
  • It always seemed wrong that out of order execution was more efficient than in order. I'll have to read up on it more. You'd think that the results coming in out of order would mess things up.

    • It always seemed wrong that out of order execution was more efficient than in order.

      The CPU is otherwise just sitting around waiting for stuff to happen. Sometimes the stuff it was waiting for means that what it did out-of-order has to be thrown away and done again, which means that it used that energy (and thus dissipated that heat) for nothing.

    • by mwvdlee ( 775178 ) on Monday February 02, 2015 @09:30AM (#48957597) Homepage

      Results don't come in out-of-order. Imagine two variables, A and B, each undergoes a number of calculation steps which don't refer to the other variable. I.e. A=A+5/2*13-29 and B=B*B*3+12/N, then finally adding them together as Z=A+B. Normal execution would first do all the calculations for A then all the calculations for B, then finally Z. Out-of-order execution would calculate both A and B simultaneously, wait for both to finish, then calculate Z. Out-of-order execution involves a lot of this type of waiting, but since it's waiting for just the slowest calculation instead of the sum of both the slowest and fastest calculation it ends up being done sooner. If things cannot be calculated like this, an out-of-order capable processor will simply do things in-order.

      At least that's how I understand it at a very abstract level.

      • What you say is pretty much correct.
        The reason out of order execution is faster, is because in most cases the compiler doesn't know the full pipeline delay of each instruction.
        This helps with making binaries compatible with different processors implementing the same instruction set.
        For instance, the compiler might assume that +, -, *, and / all take the same amount of cycles to calculate, when in reality of course some of those will be faster than others.

        A calculation like A = B*C; B = A+D; C= C+1 could the
      • The article is about scheduling access to thread prioritization lists implemented as linked lists, not scheduling individual instructions. The issue involves multiple CPUs contending for the root (start) of a linked list, which causes cache-invalidation in the CPUs, thereby causing systemic inefficiencies.

        A very dumbed-down description of their solution is to have multiple "root" lists in order to reduce the frequency of cache-contention/invalidation.

    • If you want to read up on this stuff, google on "ibm patent multiple run queues" . . . they have done a lot of work in this area.

    • by gnupun ( 752725 ) on Monday February 02, 2015 @12:47PM (#48959351)

      It always seemed wrong that out of order execution was more efficient than in order.

      Dependency between instructions can temporarily delay execution of some instructions. This can be easily explained with an example:

      1) A = A + 1
      2) B = A + 2
      3) C = D + 3

      Above is pseudo code of 3 instructions being executed by a CPU. While instr #1 is being executed, the CPU will also try to execute instr #2, but since instr #2 needs the current value of register A, the CPU waits until instr #1 is complete and value of A can be used to calculate B. This is why in-order CPUs can be simpler to create, but be slightly slow.

      Out-of-order CPUs work around this problem by deferring instr #2. Instead, the CPU executes instr #3 which has no dependency to instr #1 or #2. Therefore an OOO CPU can execute two instructions in parallel in this case, whereas the in-order CPU may not be able to execute multiple instructions in some cases.

  • by Anonymous Coward on Monday February 02, 2015 @09:19AM (#48957531)

    3D print it too, that always helps!

  • For tasks that need low level code optimization I can't see how this wouldn't hurt processing speed. Maybe it's time to add a tiny brain to each CPU capable of learning, in a primitive way, which execute orders work best?
    • by Bengie ( 1121981 )
      Now you just need to define what a "task" is. This is normally an OS or framework construct.
  • by OneSmartFellow ( 716217 ) on Monday February 02, 2015 @09:35AM (#48957609)
    ...many computing algorithms rely upon randomization for efficiency.
    • Interesting thought. It reminds me of quantum mechanics and probabilities. Perhaps teh Big Bang was just another iteration of rnd(God).

  • Avoiding bottlenecks (Score:4, Interesting)

    by Zorpheus ( 857617 ) on Monday February 02, 2015 @09:38AM (#48957617)
    The article says that the SprayList algorithm is faster for many cores than a traditional priority queue, since there are collisions when several cores ask for the top priority task at once.
    Couldn't you just distribute the tasks ahead of time, giving every core a new task before its current task is finished?
    Also, the article syas:

    Random assignment has traditionally been frowned upon by those who think a lot about computer processors, the researchers noted in a paper explaining the work. A random scheduling algorithm takes longer to jump around the queue than a conventional one does. Caches can't be used to store upcoming work items. And if a set of tasks needed to perform one job are executed out of order, then the computer needs additional time to reassemble the final results.

    I would think these problems are the same for the priority queue that they compare performance to. And I guess there are other ways which avoid these problems, which might produce faster results.

    • by Bengie ( 1121981 )
      They're not pushing or keeping a back-log of work to be done. Each core only accesses the task list once it's done with its current task, but it grabs a task random from a range of tasks at the beginning of the queue instead of just the first task.

      "Distributing" requires something to push. You don't want to push because you might send a task to an already overloaded core. You could keep track of which core has the least amount of work remaining, assuming that's even possible to know, but then you're back
      • by Kjella ( 173770 )

        But assuming you have a known number of worker threads you could predistribute tasks in a round-robin fashion so that when thread #3 asks for a tasks it gets nextTask[3], off you go and we'll repopulate nextTask lazily once we've got lock on the main queue again, I assume if it's acceptable to pick from a range any one task will finish within an acceptable delay so that next task still gets done in time too. I wouldn't want to put a call to rand() anywhere in there.

        • by Bengie ( 1121981 )
          Round Robbin is bad because it can have synchronization issues. What if a task takes hours or day to complete? Pre-populating a task can cause a task to never complete because you assigned a task to the "wrong" core.
        • But assuming you have a known number of worker threads you could predistribute tasks in a round-robin fashion so that when thread #3 asks for a tasks it gets nextTask[3], off you go and we'll repopulate nextTask lazily once we've got lock on the main queue again, I assume if it's acceptable to pick from a range any one task will finish within an acceptable delay so that next task still gets done in time too. I wouldn't want to put a call to rand() anywhere in there.

          It's not about assigning tasks to threads, though, it's about assigning threads to cores. There is no authority above the core, and so the cores have to manage their own scheduling. If you assign each core an index, then you no longer have a true queue, and if core 1 hangs, the head of the list will just sit there indefinitely as core 2 continues to take task 2, etc. And what about dynamic systems that power down cores. Do you reassign core indexes ever time a core shuts down? Now suddenly you're introducin

    • Keeping in mind that I'm NOT a programmer that has spent a lot of time thinking about this...

      But pre-distributing tasks just sounds like a queue becoming many little queues without any particular advantage. Moreover, if you're not sure when a processor is going to be done its work, pre-assigning work may lead to bigger bottlenecks and starvation. If you try to pre-analyse the tasks so you know how much work is needed, you're effectively wasting cycles on a problem that doesn't need to be solved--it's almost

    • Yes . . . avoid bottlenecks . . . but do you really know what they are . . . ?

      Is the workload I/O bound? Are multiple processors hanging in spin locks? Can you detect lock contention on your system? How about local cache invalidation on memory addresses, because all the processors want to update the same memory address. Can you split this global address into several local per-processor addresses? Re-align variables so they are not in the same cache line . . . ?

      You really need to know your enemy, if y

      • Rule number one about computer performance . . . know your workload!

        I think of the algorithm in TFA as "Socratic computing performance" -- the only true knowledge about your workload is knowing that you know nothing about your workload. The more the computer has to introspect about individual tasks on the fly, the higher the computational load. Eventually you end up with the cost of optimisation being higher than the cost of the inefficiencies in the scheduling.

  • RTFA, it was just an elaborate troll to get this surrealist theme song parody stuck in your head: []

    Good for CPUs numbering up into the 80's. Sheesh.

  • by Anonymous Coward

    SprayList explained here [].

  • by Anonymous Coward

    The 68000 had one approach to memory management that was to randomly delete a cell, and repeat. It was more efficient than all the "top-down" solutions of the day. Statisticians compare against a "random" approach when trying to determine if an effect is statistically meaningful. They ask "does the data support that approach x yields more than would be done by chance". It would be nice if computer science of this century could catch up to or surpass the basic statistical ideas developed in the late 1800

  • Like a hash table?

  • Then why did it still beat SprayList at 32+ cores? The only time SprayList beats the competition is at 16 cores.

    • Then why did it still beat SprayList at 32+ cores? The only time SprayList beats the competition is at 16 cores.

      The graph shows total throughput, but that isn't a perfect measure, because it ignores the delay to high priority tasks. If you used random scheduling on a games console, you would risk rockets freezing in midair, erratic screen refresh rates, sound-effects that were out of sync with the action, server lag, etc etc etc.

      The researchers claim that SprayList is almost as good as non-prioritised scheduling, but that it respects priority enough that the worst side-effects of truly random scheduling are mitigated

      • by Bengie ( 1121981 )
        Given enough throughput, you can statistically reduce your jitter to some low number. The transition could be painful.
        • Yes, but that would mean using a much more powerful processor. The researchers have basically loaded their dice to roll snake-eyes often enough to win big, but not so often as to get chucked out of the casino.
          • by Bengie ( 1121981 )
            We're reaching a point where we have more transistors than we can turn on without melting the chip. You can have a lot of spare cores that are turned off and can be turned back on for the occasion jitter inducing burst of work. The extra cores act as a buffer to absorb unexpected work, while allowing your target core utilization as your average.

            Traffic shapers already have this notion of "real time" traffic, where the packet scheduler can pretty much guarantee that latency sensitive traffic can get schedul
            • Still, if the front of the queue doesn't have priority, you can't guarantee that a given task will ever be performed - you can play roulette 1000 times and still never see the ball land on number 1....
  • if you bump into another packet you back off a random wait time. The other sender will back off, also a random wait time. Less likely to bump again.

    • by Bengie ( 1121981 )
      Collisions with Ethernet are because more than one thing are trying to send at the same time, which means there is too much going on. The issue with blocking and the whole back-off things is the problem they're trying to solve is keeping these cores from having to back-off in the first place. The backing off is making them sit idle for way too long.
  • by Wrath0fb0b ( 302444 ) on Monday February 02, 2015 @12:13PM (#48958999)

    This is old hat in the CS world that gets re-discovered fairly often: you can increase throughput at the cost of ravaging your latency. For some tasks, this is an acceptable tradeoff -- for others (especially anything interactive) it's completely unacceptable. Moreover, any synchronization point in the program converts the worst-case latency of the previous tasks into limits on throughput, e.g. the time it takes to join a set of N threads is equal to the maximum latency of any single thread in the list.

    The best analogy is an elevator (sorry car folks): you can optimize your elevator for throughput by having it always select the closest floor with an active call. The cost, obviously, is that if people are shuffling between floors 1 & 5 a lot, then the poor guy up on 30 might wait a really long time. The throughput is still maximized though, since the elevator can do more operations per unit time by avoiding the long trip to and from floor 30.

    In some cases this is fine, in the vast majority of cases you want to ensure that all tasks complete in a more bounded amount of time, even if that reduces the total number of tasks completed per unit time.

    • by Bengie ( 1121981 )
      I agree, the classic throughput vs latency. In the case of a high throughput 80 core CPU, I assume the trade-off is worth it, but for the quad-core desktop CPU, they might care more about not having their game stuttering.

      They might be able to do a hybrid design where the break up latency sensitive tasks into smaller groups, where each group has a max number of shared cores to reduce contention, and the cores can be close to each other, because physical distance becomes an issue with large many core CPUs.
    • The whole point is that they're claiming they can get throughput that is approaching best-case while keeping latency acceptable. It's just a matter of setting the probabilities such that one of the processors is likely to hit the head of the list in an acceptably short timeframe, which getting the probability of two trying to hit the head simultaneously as near to zero as possible.

      As I've pointed out a few times -- this is a stochastic system, not a truly random one.

  • I guess the headlines of AMD strategic meeting is gonna me something in the like of "MOAR CORESSS!!!!!"
  • I have never really 100% understood why it is better for sorting, than say bubble, is there any relation to this approach re core usage?
  • by LostMyBeaver ( 1226054 ) on Monday February 02, 2015 @02:13PM (#48960385)
    I have been workong directly and indirectly with massive scale computing for decades:

    1) break single threaded tasks into smaller jobs that can be distributed.
    2) test that they still work
    3) establish a chain of dependencies so that tasks which must be performed in order will.
    4) establish pipeline cost and cache coherency cost for each job.
    5) distribute tasks who share common data sets to cores which have the lowest latency between them
    6) take jobs which depend on output of other jobs and offset their start time so the input data from the other job is available befor it is need in the current job decreasing the cost or synchronization.
    8) optimize job distribution so that tasks running in parallel with dependencies will be distributed to cores nearest to one another in the coherency ring.

    This is old school thinking. Randomizing should not offer better performance than properly scheduled tasks.
    • Oh... And 7 is not lucky :)
    • but the typical workload of general purpose system is missing a few of those requirements, kind of like in the movie Sneakers where a character says of organized crime "it ain't that well organized". Our devices get more and more cores each year....

  • by Anonymous Coward

    When I eat my dinner I leave the good stuff until last; that way I end the meal with a pleasant taste in my mouth.

Houston, Tranquillity Base here. The Eagle has landed. -- Neil Armstrong