Forgot your password?
typodupeerror
Hardware Technology

Researchers Unveil Experimental 36-Core Chip 143

Posted by samzenpus
from the we-need-another-core dept.
rtoz writes The more cores — or processing units — a computer chip has, the bigger the problem of communication between cores becomes. For years, Li-Shiuan Peh, the Singapore Research Professor of Electrical Engineering and Computer Science at MIT, has argued that the massively multicore chips of the future will need to resemble little Internets, where each core has an associated router, and data travels between cores in packets of fixed size. This week, at the International Symposium on Computer Architecture, Peh's group unveiled a 36-core chip that features just such a "network-on-chip." In addition to implementing many of the group's earlier ideas, it also solves one of the problems that has bedeviled previous attempts to design networks-on-chip: maintaining cache coherence, or ensuring that cores' locally stored copies of globally accessible data remain up to date.
This discussion has been archived. No new comments can be posted.

Researchers Unveil Experimental 36-Core Chip

Comments Filter:
  • by nimbius (983462) on Monday June 23, 2014 @08:55AM (#47297223) Homepage
    All this performance in just one chip. I mean, sure, it has 36 cores but lets be rational here...does it seriously expect to run crysis?
    • Re: (Score:1, Funny)

      by Anonymous Coward

      You'd need to imagine a beowulf cluster of 'em to accomplish that.

    • To really run crysis well, you'd probably need something like the GeForce GTX Titan-- which has 896 double precision cores. However, if you raytrace the graphics, you might be able to run it on a 72 core Knights Landing chip.

  • http://www.adapteva.com/epipha... [adapteva.com]
    64 cores, mesh network that extends off the chip, in production.

    Try harder MIT :-p

    • by itzly (3699663)
      Adding cores is easy. Keeping all the cores busy with useful work in a typical range of high performance applications is the difficult part.
    • by Melkhior (169823)
      <quote><p> <a href="http://www.adapteva.com/epiphanyiv/">http://www.adapteva.com/epipha...</a>
      64 cores, mesh network that extends off the chip, in production.</p><p>Try harder MIT :-p</p></quote>

      They already tried harder : http://www.tilera.com/. And as another post mentioned, Intel Knights Corner is cache coherent on 61 cores (62 architectured).

      The summary doesn't get the point of the article: what's novel is not the presence of cache coherency, it's just the
    • by TheRaven64 (641858) on Monday June 23, 2014 @01:00PM (#47298847) Journal

      The core count isn't the interesting thing about this chip. The cores themselves are pretty boring off-the-shelf parts too. I was at the ISCA presentation about this last week and it's actually pretty interesting. I'd recommend reading the paper (linked to from the press release) rather than the press release, because the press release is up to MIT's press department's usual standards (i.e. completely content-free and focussing on totally the wrong thing). The cool stuff is in the interconnect, which uses the bounded latency of the longest path multiplied by single-cycle one-hop delivery times to define an ordering, allowing you to implement a sequentially consistent view of memory relatively cheaply.

      Since I'm here, I'll also throw out a plug for the work we presented at ISCA, The CHERI capability model: Revisiting RISC in an age of risk [cam.ac.uk]. We've now open sourced (as a code dump, public VCS coming soon) our (64-bit) MIPS softcore, which is the basis for the experimentation in CHERI. It boots FreeBSD and there are a few sitting around the place that we can ssh into and run. This is pretty nice for experimentation, because it takes about 2 hours to produce and boot a new revision of the CPU.

  • by SirDrinksAlot (226001) on Monday June 23, 2014 @09:09AM (#47297289) Journal

    So what's special about this chip that Intel's Xeon Phi (first demonstrated in 2007 as Knights Landing with 80 or so cores) isn't already doing? Or is this just a rehash of 7 year old technology that's already in production? It sounds like a copy/paste of Intel's research.

    "Intel's research chip has 80 cores, or "tiles," Rattner said. Each tile has a computing element and a router, allowing it to crunch data individually and transport that data to neighboring tiles." - Feb 11, 2007

    • Presumably the novel way they address (pun intended) cache coherency is what is new. More efficiency = greater performance. Time will tell.

      • by Trepidity (597) <delirium-slashdot@@@hackish...org> on Monday June 23, 2014 @09:24AM (#47297359)

        Yes, as usual, the MIT press release oversells the research, while the original paper [pdf] [mit.edu] is a bit more careful in its claims. The paper makes clear that the novel contribution isn't the idea of putting "little internets" (as the press release calls them) on a chip, but acknowledges that there is already a lot of research in the area of on-chip routing between cores. The paper's contribution is to propose a new cache coherence scheme which they claim has scalability advantages over existing schemes.

        • by epine (68316)

          The paper's contribution is to propose a new cache coherence scheme which they claim has scalability advantages over existing schemes.

          Somehow this was obvious to me even from the press release. I've never yet seen details of an ordering model laid bare where it wasn't the core novelty. Ordering models are inherently substantive. Ordering models beget theorems. Cute little Internets drool and coo.

    • by gman003 (1693318)

      It does seem rather similar - a large cluster of cores, laid out in a grid topology. Perhaps they're doing something different with the cache coherency? I couldn't find too much on how Intel's handling that, and it seems to be a focus of the articles on this chip.

  • I would be curious to know more about the architecture and all around chip specs they are using in their prototype: clock speed, memory interface, etc. The article states they are developing a version of Linux to test it on, so it's safe to say it's an established architecture. Anyway, I am excited to see the results once they have tested it on Linux. While this does not help with the density per core problem, perhaps it will help extend Moore's Law from the perspective of speed increase in respect to micro
  • So, in one die, it's a little interesting, though GPU stream processors and Intel's Phi would seem to suggest this is not that novel. The latter even let's you ssh in and see the core count for yourself in a very familiar way (though it's not exactly the easiest of devices to manage, it's still a very much real world example of how this isn't new to the world).

    The 'not all cores are connected' is even older. In the commodity space, hypertransport and QPI can be used to construct topologies that are not fu

    • by Trepidity (597) <delirium-slashdot@@@hackish...org> on Monday June 23, 2014 @09:39AM (#47297429)

      The basic idea isn't new. What the paper is really claiming is new is their particular cache coherence scheme, which (to quote from the Conclusion) "supports global ordering of requests on a mesh network by decoupling the message delivery from the ordering", making it "able to address key coherence scalability concerns".

      How novel and useful that is I don't know, because it's really a more specialist contribution than the headline claims, to be evaluated by people who are experts in multicore cache coherence schemes.

      • by enriquevagu (1026480) on Monday June 23, 2014 @02:14PM (#47299371)

        Some knowledge about multicore cache coherence here. You are completely right, Slashdot's summary does not introduce any novel idea. In fact, a cache-coherent mesh-based multicore system with one router associated to each core was presented on the market years ago by a startup from MIT, Tilera [tilera.com]. Also, the article claims that today's cores are connected by a single shared bus -- that's far outdated, since most processors today employ some form of switched communication (an arbitrated ring, a single crossbar, a mesh of routers, etc).

        What the actual ISCA paper [mit.edu] presents is a novel mechanism to guarantee total ordering on a distributed network. Essentially, when your network is distributed (i.e., not a single shared bus, basically most current on-chip network) there are several problems with guaranteeing ordering: i) it is really hard to provide a global ordering of messages (like a bus) without making all messages cross a single centralized point which becomes a bottleneck, and ii) if you employ adaptive routing, it is impossible to provide point-to-point ordering of messages.

        Coherence messages are divided in different classes in order to prevent deadlock. Depending on the coherence protocol implementation, messages of certain classes need to be delivered in order between the same pair of endpoints, and for this, some of the virtual networks can require static routing (e.g. Dimension-Ordered Routing in a mesh). Note a "virtual network" is a subset of the network resources which is used by the different classes of coherence messages to prevent deadlock. This is a remedy for the second problem. However, a network that provided global ordering would allow for potentially huge simplifications of the coherence mechanisms, since many races would disappear (the devil is in the details), and a snoopy mechanism would be possible -- as they implement. Additionally, this might also impact the consistency model. In fact, their model implements sequential consistency, which is the most restrictive -- yet simple to reason about -- consistency model.

        Disclaimer: I am not affiliated with their research group, and in fact, I have not read the paper in detail.

  • While adding an extra core or two made big jumps in performance (because you are almost always running at least two applications) there comes a point where most users won't see a performance boost. While I may now be able to throw 36 processors at a problem, you have to program all those cores to work together. Right now that's a lot of effort, and until programming languages catch up and can optimize code by making it massively parallel, this is going to be a non-starter.

    • by itzly (3699663)
      A "new programming language" isn't a magical solution to make a non-parallel algorithm work well on a multi processor architecture.
      • by Z00L00K (682162)

        The question is - do you always need a parallel tasking software? Most tasks are bread&butter tasks, no need to chew them up. Put your energy into the few things that do need to be broken up.

        But mostly it's a "hen and egg" problem - can't do multi-core software since there aren't enough serious multi-core machines, or the owners in software companies don't see a benefit in it.

    • Maybe Scala can be your language. It supports creating your code out of mostly immutable objects, which makes it good for parellelism.

  • by magsol (1406749) on Monday June 23, 2014 @09:27AM (#47297373) Journal
    pointer arithmetic, cache invalidation, and off-by-one errors
  • Interesting (Score:4, Informative)

    by Virtucon (127420) on Monday June 23, 2014 @09:31AM (#47297395)

    Cache coherency has been one of the banes of multicore architecture for years. It's nice to see a different approach but chip manufacturers are already getting high performance results without introducing additional complexity. The Oracle (Sun) Sparc T5 [oracle.com] architecture has 16 cores with 128 threads running at 3.6Ghz. It gives a few more years to Solaris at least but it's still a hell of a processor. For you Intel fans the E7-2790 v2 [intel.com] sports 15 cores with 30 threads with a 37.5MB cache so they're doing something right because it screams and is capable of 85GB/s memory throughput.

    I'm sure the chip architects are looking at this research but somehow I think they're already ahead of the curve because these kinds of cores/threads are jumps ahead of where we were just a few years ago. Anybody remember the first Pentium Dual Core [wikipedia.org] and The UltraSparc T1 [wikipedia.org]?

    • by Bengie (1121981)
      High "thread" count cores are good for work loads where there is little inter-thread communication and has lots of memory stalls. By having a lot of threads running at once, whenever there is a memory stall, you can just switch to another thread, and the chance of that thread being stalled is very low. This also means lots more cache thrashing, so you need larger caches, but they can be tuned for high-throughput high-latency. The entire design for these cpus is geared for high-throughput high-latency, which
      • by Virtucon (127420)

        Oh no question, high thread counts would make sense for say a web service application server vs. something more compute intensive. None of these architectures will ever be in the terraflop or petaflop range for that so there will still be need for specialization of highly compute intensive workloads to those kinds of systems. One thing that will kill this architecture is software compatibility, so it'd be interesting to see if it does take off. In the meantime Moore's law will keep pushing the Sparc and

  • Parallel processing has made big strides, but only in some limited areas. Graphics rendering where each pixel can be updated independent of other pixels. Or in fluid mechanics (CFD) using time marching techniques where updating the solution at one point needs data from a limited set of neighbors, or in iterative solvers of matrices. Even something very structured without if statements like inverting a matrix, parallel methods have suffered.

    Basic problem is this, even if just 5% of the work has to be serial, the maximum speedup is 20x, that is the theoretical maximum. YMMV, and it does. Internet and search has opened up another vast area where a thread can do lots of work and send just very small set of results back to the caller. Hits are so small compared to misses, you can make some headway. Even then we have found very few applications suitable for massively parallel solutions.

    We need a big breakthrough. If you divide a 3D domain into a number of sub domains, the interfaces between the subdomains is 2D. The volume of 3D domain represents computational load, and the area interfaces represent the communication load. If we could come up with domain-division algorithms that guarantee the interfaces would be an order of magnitude smaller, even as we go from 3D to higher number of dimensions, and if we could organize these subdomains into hierarchies, we would be able to deploy more and more of computational work, and be confident the communication load would not overwhelm the algorithm. This break through is yet to come. Delaunay Tessellations (and its dual Voronoi polygons) have been defined in higher dimensions. But the number of "cells" to number of "vertices" ratio explodes in higher dimensions, last time we tried, we could not even fit a 10 dimensional mesh of 10 points into all the available memory of the machine. It did not look promising.

  • ..the Transputer. Great idea, but a giant market fail.
  • There are hundreds of processors with 64 cores or more, each of them claiming to have solved the scalability problem.

  • This is a nice little trick. This has the potential to extend shared consistent memory multiprocessor designs to far larger numbers of processors. Whether this is a performance win remains to be seen. Good idea, though. Note that the prototype chip is just a feasibility test; they used an off the shelf Power CPU design, added their interconnect network, and send the job off to a fab. A production chip would have optimizations this does not.

    We known only two general purpose multiprocessor architectures t

  • I don't see what the big deal is. I'm currently working with early silicon on a cache coherent 48-core 64-bit MIPS chip with NUMA support and built-in 40Gbps Ethernet support. The chip also has a lot of extended instructions for encryption and hashing plus a lot of hardware engines for things like zip compression, RAID calculations, regular expression engines and networking support among other things. It also has built-in support for content addressable memory.

    It also has a network on-chip where each core

  • MIT is expert a making these sort of PR stunts were they claim they invented something novel when they replicate some boring old result from 10yr ago. Well, here it is 30yr ago.

I'd rather just believe that it's done by little elves running around.

Working...