MIT's Swarm Chip Architecture Boosts Multi-Core CPUs, Offering Up To 18x Faster Processing (gizmag.com) 55
An anonymous reader writes from a report via Gizmag: MIT's new Swarm chip could help unleash the power of parallel processing for up to 75-fold speedups, while requiring programmers to write a fraction of the code that is usually necessary for programs to take full advantage of their hardware. Swarm is a 64-core chip developed by Prof. Daniel Sanchez and his team that includes specialized circuitry for both executing and prioritizing tasks in a simple and efficient manner. Neowin reports: "For example, when using multiple cores to process a task, one core might need to access a piece of data that's being used by another core. Developers usually need to write code to avoid these types of conflict, and direct how each part of the task should be processed and split up between the processor's cores. This almost never gets done with normal consumer software, hence the reason why Crysis isn't running better on your new 10-core Intel. Meanwhile, when such optimization does get done, mainly for industrial, scientific and research computers, it takes a lot of effort on the developer's side and efficiency gains may sometimes still be minimal." Swarm is able to take care of all of this, mostly through its hardware architecture and customizable profiles that can be written by developers in a fraction of the time needed for regular multi-core silicon. The 64-core version of Swarm came out on top after MIT researchers tested it out against some highly-optimized parallel processing algorithms, offering three to 18 times faster processing. The most impressive result was when Swarm achieved results 75 times better than the regular chips, because that particular algorithm had failed to be parallelized on classic multi-core processors. There's no indication as to when this technology will be available for consumer devices.
Parallelization... (Score:5, Insightful)
Re: (Score:2)
Yup, many big cores do that today. It's called speculative execution.
Re:Parallelization... (Score:4, Informative)
No. The P6 does branch prediction. When you get to a branch, the processor guesses which one is taken and executes that. If it guessed wrong, it throws away all of the speculative results. The grandparent is talking about executing both branches. The up side of this is that you never miss-predict a branch. The downside is that it's not really feasible and gives a huge increase in power consumption. A modern superscalar processor can easily have 50 instructions in flight at once (the Pentium 4 could have 140, which is partly why it rarely hit its peak performance). You have a branch, on average, every 7 instructions. To fill a pipeline of 50 instructions, you need to speculatively execute past 7 branches. Often these are loops, so branch prediction does a good job. Now imagine that you executed every path. After 7 branches, there are 128 possible places you could be. Each one of those includes an average of 7 instructions, so to be able to do all of that you'd need 18 times as many functional units. Register renaming (which is already one of the largest costs on the chip) would become vastly more complicated. Your processor would need liquid helium poured on it to keep it at a stable temperature. And, at the end of this, you'd still not have much better performance.
And that is assuming that all branches are simple conditionals, not computed branches (C++ virtual calls, cross-library calls via a PLT, function pointer calls, and so on). You can't execute all of the possible targets for a computer branch, so you'd still need the branch predictor infrastructure to handle this case, so you're not even saving much on hardware.
A few experimental chips have tried doing this for branches where the predictor doesn't give a high confidence of either path. In this kind of limited use, executing both branches at half speed, rather than executing one with a 50% chance of needing to discard the result, gives slightly better performance.
Re: (Score:2)
You have a branch, on average, every 7 instructions. To fill a pipeline of 50 instructions, you need to speculatively execute past 7 branches.
Oh, that gives me an idea of spacing my branches out better to speed things up. At least experimenting with it.
Re: (Score:2)
Re: (Score:2)
In any case, if you're going for efficiency, it's worth experimenting with.
Re:Parallelization... (Score:4, Informative)
Branch prediction integrated with the pipeline. Most CPUs do not execute both branches so much as they perform all the work required to quickly switch to the alternate branch should a branch not go as predicted. This implies an alternate pipeline into which the instructions for the alternate branch are queued. This might not sound like much but it actually constitutes >90% of the work a CPU must perform. The ALU is fast and simple but getting the correct data to and from the ALU is challenging.
CPUs can also support multiple ALUs - but this is not to speed branches. Multiple ALUs are used when the CPU detects that incoming instructions are not dependent on one another and can be executed concurrently. When detected, instructions are executed in parallel. The benefits gained are limited and it comes at the cost of extra transistors. However, because you have less movement of data, power requirements are reduced.
Look at the Apple A9 CPU compared to alternate multi-core ARM chips that are available. The A9 is just as fast while running fewer cores at lower clock rate while consuming less power. It is able to do so by using the previously mentioned techniques. It uses billions of transistors and costs more to produce then other chips that are just as fast. Not a good choice for making devices with low profit margins, but an excellent choice if you can afford it.
Re: (Score:3)
While true, multi-processor systems are considerably more responsive while busy with another task. So, e.g., you can be downloading upgrades, compressing files, and word processing all at the same time without penalty. Admittedly, it's hard to see how that particular scenario would be better with 100 cores than with 5 or 6. But a batch of them could be rendering an animation or some such.
FWIW, I have a task in mind where 1,000 cores would not be overkill, but most users would never do it. However they m
Re: (Score:2)
Re: Parallelization... (Score:1)
Re: (Score:3)
Re: (Score:2)
How did they claim a 75x speedup using 64 cores?
Re: (Score:2)
Re: (Score:2)
Don't underestimate the overhead expense of context switching.
Re: (Score:2)
When you subdivide a problem, each core works on a smaller subset. If those subsets fit into a cache that the bigger problem didn't, you can easily get superlinear increase as a result. In many cases you could actually rewrite the bigger problem to be more cache-friendly and get a similar speedup, so you generally don't make much of such "extra" performance increases.
Re: (Score:1)
Most apps that need more processing benefit from multithreading, which you get with multiple cores. Parallel code is when a thread is broken down into mini threads and spread over multiple cores and then recombined to get the result. It creates overhead in a way, but for BIG number crunching it's especially useful.
If you want to run a simple process as fast as possible, you just run it and it runs and it's done.You can't really benefit from parallel code for most tasks, as you say.
But you seem to fail to re
Re: (Score:2)
Special-Purpose chips (Score:4, Insightful)
I guess the world is rediscovering that special-purpose chips will always be faster at their special purpose than a general-purpose chip will be.
Re: (Score:2)
can't this hardware be translated to software? (Score:4, Interesting)
i am dumb on this, but if 'hardware architecture' can be made to take care of avoiding conflicts and "direct how each part of the task should be processed and split up between the processor's cores", same can be done through software that imitate whatever 'hardware architecture' is doing?
if this can be done, basically this software would be another step in compiling/assembling process?
as i said, i am ignorant on this, but why not?
Re: (Score:3)
I've only had a quick look at their press release, is there a pre-print of their paper anywhere?
This looks like a hardware implementation of something like "Grand Central Dispatch". Combined with transactional memory.
The basic idea seems to be that you can take a serial-ish process, break it up into tasks. Start running the first few tasks that should obviously run first. Then if you have spare CPU cores, you can also start speculatively executing later tasks. But if these speculative tasks hit a conflict
Re: (Score:2)
http://livinglab.mit.edu/wp-co... [mit.edu]
They use individual cores to speculatively execute very short sequences of instructions, for instance, a function call or loop iteration. The algorithms they benchmark resemble the architecture -- where there's a lot of very small code sequences that aren't usually very dependent on each other, however the individual code sequences aren't large enough for traditional thread-based solutions with high-synchronization overhead to work.
One wonders how this would compare to a loc
Any hardware can be software. Doesn't mean it shou (Score:3)
Sure it -could- be done in software. Essentially any design can be implemented as hardware, software, or a hybrid of the two. (A major problem for those complaining about "software patents".) I wouldn't be surpised if someone does take some of their ideas and implement them in software.
In general, hardware will be faster and in some ways more reliable than a software implementation of the same algorithm. It also means software doesn't have to be recompiled for lots of different types of hardware, if the
Re: (Score:2)
Actually, not so much a problem for software patents. Software is, generally speaking, a general solution, an algorithm. That is to say, math - something explicitly exempted from patent protection because it would inherently be overbroad and cut off all further development in that direction. Hardware is a machine - a specific implementation. Make some slight modifications, and it's no longer protected by the original patent.
If software patents followed the same rules as hardware patents they'd be far les
Look up Verilog, SystemC (Score:2)
To to your point, look up "SystemC". It's the C programming language, used to write programs which are often compiled as pure hardware. Often, but not always - the same code can be rendered as either pure hardware or pure software. See also Verilog and PLAs. PLAs start and end their life as pure hardware devices. In between, connections in the hardware are destroyed to create a new hardware array as specified by programming language code.
What you're missing is that any algorithm, most any code, can be c
Re: (Score:2)
The thing is - the compiler could potentially generate a long list of different binaries or hardware configurations that all result in the same functionality within some performance envelope. As hardware, every one of those different assemblies would potentially require a separate patent as it does the same thing in a different manner, and hardware patents only protect specific implementations. As a software patent though, as they stand now, you don't even need to offer the source code that could generate
Re: (Score:3)
i am dumb on this, but if 'hardware architecture' can be made to take care of avoiding conflicts and "direct how each part of the task should be processed and split up between the processor's cores", same can be done through software that imitate whatever 'hardware architecture' is doing?
From reading the MIT page, I gather that it should be possible but it would result in substantial overhead. The bloom filter alone would also need it's own core.
if this can be done, basically this software would be another step in compiling/assembling process?
Yes, however, this would not be helpful for 99% of software because most software simply cannot benefit from parallel processing. The one area that benefits the most from parallel processing is graphics, specifically manipulation and rendering. That said, where this may be able to help is in creating a better GPU, so it should be no surprise that
Re: (Score:2)
I am going to assume that you are new to programming, because claiming that most software cannot benefit from parallel processing is hilariously false. It's just most programmers can't do it, or do it well that is the issue. Almost all software today can benefit from parallel processing, it's just a matter of how much, and if it is worth the expense of actually getting a programmer who can do it rather than throwing 10 code monkeys in a room to bang out barely functional code.
Re: (Score:2)
your arrogance is astounding.
Not as you describe it (Score:1)
If this hardware does something that could be done at compile time, it is IMHO indeed pretty useless. That's why I hope it is "runtime-smart", meaning that it reacts to data access conflicts as they actually happen while the program is running. That would be something that is much harder to achieve, in an efficient manner at least, through software. The talk about the profiles devs have to declare doesn't sound good to me: people who don't bother writing software that uses proper locking or libs implementin
Hmm this sounds like it should be software (Score:2)
Sounds much more like something that should be refinements to code generation than baked into chip architecture. That said it's good to see work being done on better parallel methods rather than just bigger.
an older paper describing Swarm (Score:3, Informative)
Processes vs threads (Score:3)
Second, they allow a single program to do more than one thing at a time. Lots of programs will have a separate thread to handle the user interface while another does background tasks, but few will try and break big tasks into multiple pieces. For example, many database programs will be able to run several independent queries at the same time, but few will run a single query faster on a multi-core machine than on a single core one.
I am working on a new data management system that does both. It can let lots of queries run at the same time, and it can break a single query into smaller pieces. The more cores the better. A query that takes 1 minute on a single core can often do the same thing in about 1/5 the time on a quad core (8 threads).
Headline has 18% More Unresolved References... (Score:2)
...than other articles. No, really, "more" and "less" only work when comparing things.
Crysis? (Score:2)
I'm pretty sure the developers of Crysis did put in the work to parallelize it effectively. Game engines are one of the most heavily optimized types of software out there, and CryEngine is one of the fastest game engines out there.
Mandatory comment (Score:2)