MIT Randomizes Tasks To Speed Massive Multicore Processors 63
itwbennett writes Researchers at the Massachusetts Institute of Technology have created a data structure that they claim can help large multicore processors churn through their workloads more effectively. Their trick? Do away with the traditional first-come, first-served work queue and assign tasks more randomly. The SprayList algorithm allows processors with many cores to spread out their work so they don't stumble over one another, creating bottlenecks that hamper performance.
I don't know enough about this stuff (Score:3)
It always seemed wrong that out of order execution was more efficient than in order. I'll have to read up on it more. You'd think that the results coming in out of order would mess things up.
Re:I don't know enough about this stuff (Score:4, Informative)
If I'm not mistaken, you are thinking about branch prediction, not out-of-order-executions in an otherwise serial pipe.
To elaborate, OOE deals with computing as much as possible without having to wait for a result first.
Branch prediction is a cache separate from the execution tray that attempts to predict the outcome of an if/switch or other branching evaluation and then load the pipelines to the execution tray with the computations following that branching, since the time it takes to evaluate an if/switch can be long, and without a prediction the cpu would have to stall until the evaluation is complete.
Re: (Score:2)
It always seemed wrong that out of order execution was more efficient than in order.
The CPU is otherwise just sitting around waiting for stuff to happen. Sometimes the stuff it was waiting for means that what it did out-of-order has to be thrown away and done again, which means that it used that energy (and thus dissipated that heat) for nothing.
Re:I don't know enough about this stuff (Score:4, Insightful)
Results don't come in out-of-order. Imagine two variables, A and B, each undergoes a number of calculation steps which don't refer to the other variable. I.e. A=A+5/2*13-29 and B=B*B*3+12/N, then finally adding them together as Z=A+B. Normal execution would first do all the calculations for A then all the calculations for B, then finally Z. Out-of-order execution would calculate both A and B simultaneously, wait for both to finish, then calculate Z. Out-of-order execution involves a lot of this type of waiting, but since it's waiting for just the slowest calculation instead of the sum of both the slowest and fastest calculation it ends up being done sooner. If things cannot be calculated like this, an out-of-order capable processor will simply do things in-order.
At least that's how I understand it at a very abstract level.
Re: (Score:3)
The reason out of order execution is faster, is because in most cases the compiler doesn't know the full pipeline delay of each instruction.
This helps with making binaries compatible with different processors implementing the same instruction set.
For instance, the compiler might assume that +, -, *, and / all take the same amount of cycles to calculate, when in reality of course some of those will be faster than others.
A calculation like A = B*C; B = A+D; C= C+1 could the
Re: (Score:2)
The article is about scheduling access to thread prioritization lists implemented as linked lists, not scheduling individual instructions. The issue involves multiple CPUs contending for the root (start) of a linked list, which causes cache-invalidation in the CPUs, thereby causing systemic inefficiencies.
A very dumbed-down description of their solution is to have multiple "root" lists in order to reduce the frequency of cache-contention/invalidation.
IBM patents on multiple run queues (Score:2)
If you want to read up on this stuff, google on "ibm patent multiple run queues" . . . they have done a lot of work in this area.
Re:I don't know enough about this stuff (Score:4, Interesting)
Dependency between instructions can temporarily delay execution of some instructions. This can be easily explained with an example:
1) A = A + 1
2) B = A + 2
3) C = D + 3
Above is pseudo code of 3 instructions being executed by a CPU. While instr #1 is being executed, the CPU will also try to execute instr #2, but since instr #2 needs the current value of register A, the CPU waits until instr #1 is complete and value of A can be used to calculate B. This is why in-order CPUs can be simpler to create, but be slightly slow.
Out-of-order CPUs work around this problem by deferring instr #2. Instead, the CPU executes instr #3 which has no dependency to instr #1 or #2. Therefore an OOO CPU can execute two instructions in parallel in this case, whereas the in-order CPU may not be able to execute multiple instructions in some cases.
Oooh I have an idea (Score:5, Funny)
3D print it too, that always helps!
Prevents optimization (Score:1)
Re: (Score:2)
It never ceases to amaze me how .... (Score:3)
Re: (Score:2)
Interesting thought. It reminds me of quantum mechanics and probabilities. Perhaps teh Big Bang was just another iteration of rnd(God).
Avoiding bottlenecks (Score:4, Interesting)
Couldn't you just distribute the tasks ahead of time, giving every core a new task before its current task is finished?
Also, the article syas:
Random assignment has traditionally been frowned upon by those who think a lot about computer processors, the researchers noted in a paper explaining the work. A random scheduling algorithm takes longer to jump around the queue than a conventional one does. Caches can't be used to store upcoming work items. And if a set of tasks needed to perform one job are executed out of order, then the computer needs additional time to reassemble the final results.
I would think these problems are the same for the priority queue that they compare performance to. And I guess there are other ways which avoid these problems, which might produce faster results.
Re: (Score:3)
"Distributing" requires something to push. You don't want to push because you might send a task to an already overloaded core. You could keep track of which core has the least amount of work remaining, assuming that's even possible to know, but then you're back
Re: (Score:2)
But assuming you have a known number of worker threads you could predistribute tasks in a round-robin fashion so that when thread #3 asks for a tasks it gets nextTask[3], off you go and we'll repopulate nextTask lazily once we've got lock on the main queue again, I assume if it's acceptable to pick from a range any one task will finish within an acceptable delay so that next task still gets done in time too. I wouldn't want to put a call to rand() anywhere in there.
Re: (Score:2)
Re: (Score:3)
But assuming you have a known number of worker threads you could predistribute tasks in a round-robin fashion so that when thread #3 asks for a tasks it gets nextTask[3], off you go and we'll repopulate nextTask lazily once we've got lock on the main queue again, I assume if it's acceptable to pick from a range any one task will finish within an acceptable delay so that next task still gets done in time too. I wouldn't want to put a call to rand() anywhere in there.
It's not about assigning tasks to threads, though, it's about assigning threads to cores. There is no authority above the core, and so the cores have to manage their own scheduling. If you assign each core an index, then you no longer have a true queue, and if core 1 hangs, the head of the list will just sit there indefinitely as core 2 continues to take task 2, etc. And what about dynamic systems that power down cores. Do you reassign core indexes ever time a core shuts down? Now suddenly you're introducin
Re: (Score:2)
Keeping in mind that I'm NOT a programmer that has spent a lot of time thinking about this...
But pre-distributing tasks just sounds like a queue becoming many little queues without any particular advantage. Moreover, if you're not sure when a processor is going to be done its work, pre-assigning work may lead to bigger bottlenecks and starvation. If you try to pre-analyse the tasks so you know how much work is needed, you're effectively wasting cycles on a problem that doesn't need to be solved--it's almost
Re: (Score:2)
Yes . . . avoid bottlenecks . . . but do you really know what they are . . . ?
Is the workload I/O bound? Are multiple processors hanging in spin locks? Can you detect lock contention on your system? How about local cache invalidation on memory addresses, because all the processors want to update the same memory address. Can you split this global address into several local per-processor addresses? Re-align variables so they are not in the same cache line . . . ?
You really need to know your enemy, if y
Re: (Score:3)
Rule number one about computer performance . . . know your workload!
I think of the algorithm in TFA as "Socratic computing performance" -- the only true knowledge about your workload is knowing that you know nothing about your workload. The more the computer has to introspect about individual tasks on the fly, the higher the computational load. Eventually you end up with the cost of optimisation being higher than the cost of the inefficiencies in the scheduling.
Too Many Cooks in the scheduler (Score:2)
RTFA, it was just an elaborate troll to get this surrealist theme song parody stuck in your head:
https://www.youtube.com/watch?... [youtube.com]
Good for CPUs numbering up into the 80's. Sheesh.
SprayList paper (Score:1)
SprayList explained here [mit.edu].
This is a 1980's solution (Score:1)
The 68000 had one approach to memory management that was to randomly delete a cell, and repeat. It was more efficient than all the "top-down" solutions of the day. Statisticians compare against a "random" approach when trying to determine if an effect is statistically meaningful. They ask "does the data support that approach x yields more than would be done by chance". It would be nice if computer science of this century could catch up to or surpass the basic statistical ideas developed in the late 1800
Re: This is a 1980's solution (Score:2)
Yep, I had a feeling if deja vu from computer science textbooks fine past when I saw this. First we're applauding the invention of the swamp cooker, and now this great breakthrough!
Re: This is a 1980's solution (Score:2)
I love to type in portrait but my thumbs are too fat!
Sounds familiar... (Score:2)
Like a hash table?
If random selection is bad... (Score:2)
Then why did it still beat SprayList at 32+ cores? The only time SprayList beats the competition is at 16 cores.
Re: (Score:2)
If it pans out in day-to-day operation, something like SprayList might pave the way for more effective use of new, many-core processors coming our way, such as Intel's new 18-core server chip,the E5 2600v3.
That's silly. SprayList only optimizes the case where the workload can be divided in independent jobs that can be executed out of order. Most real-life problems don't work that way.
Re: (Score:2)
Most real-life problems don't work that way.
[Morbo] Goodnight!
Re: (Score:2)
Re: (Score:2)
That's silly. SprayList only optimizes the case where the workload can be divided in independent jobs that can be executed out of order. Most real-life problems don't work that way.
Playing MP3, drawing the desktop, listening for new mail on POP, loading Slashdot, polling the keyboard... there's lots of things that my computer does at one time that are genuinely independent of each other. This is even more true of a multi-user cloud system. But more to the point, it's generally considered bad programming style to parallelise code that is inherently serial. If B relies on A, do A, then do B. Simple. If you need to thread them, then the threads should be self-syncronising, and if B needs
Re: (Score:2)
Then why did it still beat SprayList at 32+ cores? The only time SprayList beats the competition is at 16 cores.
The graph shows total throughput, but that isn't a perfect measure, because it ignores the delay to high priority tasks. If you used random scheduling on a games console, you would risk rockets freezing in midair, erratic screen refresh rates, sound-effects that were out of sync with the action, server lag, etc etc etc.
The researchers claim that SprayList is almost as good as non-prioritised scheduling, but that it respects priority enough that the worst side-effects of truly random scheduling are mitigated
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
Traffic shapers already have this notion of "real time" traffic, where the packet scheduler can pretty much guarantee that latency sensitive traffic can get schedul
Re: (Score:2)
Sounds like ethernet (Score:2)
if you bump into another packet you back off a random wait time. The other sender will back off, also a random wait time. Less likely to bump again.
Re: (Score:2)
Throughput versus Latency ... Again ... (Score:3)
This is old hat in the CS world that gets re-discovered fairly often: you can increase throughput at the cost of ravaging your latency. For some tasks, this is an acceptable tradeoff -- for others (especially anything interactive) it's completely unacceptable. Moreover, any synchronization point in the program converts the worst-case latency of the previous tasks into limits on throughput, e.g. the time it takes to join a set of N threads is equal to the maximum latency of any single thread in the list.
The best analogy is an elevator (sorry car folks): you can optimize your elevator for throughput by having it always select the closest floor with an active call. The cost, obviously, is that if people are shuffling between floors 1 & 5 a lot, then the poor guy up on 30 might wait a really long time. The throughput is still maximized though, since the elevator can do more operations per unit time by avoiding the long trip to and from floor 30.
In some cases this is fine, in the vast majority of cases you want to ensure that all tasks complete in a more bounded amount of time, even if that reduces the total number of tasks completed per unit time.
Re: (Score:2)
They might be able to do a hybrid design where the break up latency sensitive tasks into smaller groups, where each group has a max number of shared cores to reduce contention, and the cores can be close to each other, because physical distance becomes an issue with large many core CPUs.
Re: (Score:2)
The whole point is that they're claiming they can get throughput that is approaching best-case while keeping latency acceptable. It's just a matter of setting the probabilities such that one of the processors is likely to hit the head of the list in an acceptably short timeframe, which getting the probability of two trying to hit the head simultaneously as near to zero as possible.
As I've pointed out a few times -- this is a stochastic system, not a truly random one.
Meanwhile, inside AMD daily meeting... (Score:1)
rnd sort? (Score:2)
Is this news? (Score:3)
1) break single threaded tasks into smaller jobs that can be distributed.
2) test that they still work
3) establish a chain of dependencies so that tasks which must be performed in order will.
4) establish pipeline cost and cache coherency cost for each job.
5) distribute tasks who share common data sets to cores which have the lowest latency between them
6) take jobs which depend on output of other jobs and offset their start time so the input data from the other job is available befor it is need in the current job decreasing the cost or synchronization.
8) optimize job distribution so that tasks running in parallel with dependencies will be distributed to cores nearest to one another in the coherency ring.
This is old school thinking. Randomizing should not offer better performance than properly scheduled tasks.
Re: Is this news? (Score:2)
Re: (Score:2)
but the typical workload of general purpose system is missing a few of those requirements, kind of like in the movie Sneakers where a character says of organized crime "it ain't that well organized". Our devices get more and more cores each year....
Speaking of prioritizing algorithms (Score:1)
When I eat my dinner I leave the good stuff until last; that way I end the meal with a pleasant taste in my mouth.