Slashdot Log In
MIT Startup Unveils New 64-Core CPU
Posted by
ScuttleMonkey
on Mon Aug 20, 2007 03:57 PM
from the tech-is-neat-but-using-it-is-neater dept.
from the tech-is-neat-but-using-it-is-neater dept.
single-threaded writes "Tilera, a startup out of MIT, has announced that it is shipping a 64-core CPU. Called the TILE64, the CPU is fabbed on a 90nm process and is clocked at anywhere from 600MHz to 900MHz. 'What will make or break Tilera is not how many peak theoretical operations per second it's capable of (Tilera claims 192 billion 32-bit ops/sec), nor how energy-efficient its mesh network is, but how easy it is for programmers to extract performance from the device. That's the critical piece of TILE64's launch story that's missing right now, and it's what I'll keep an eye out for as I watch this product make its way in the market. Though there are any number of questions about this product that remain to be answered, one thing is for certain: TILE64 has indeed brought us into the era of 64 general-purpose, mesh-networked processor cores on a single chip, and that's a major milestone.'"
Related Stories
Submission: MIT startup unveils new 64-core CPU by Anonymous Coward
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Oblig... (Score:5, Funny)
Re: (Score:2, Funny)
Re:Oblig... (Score:4, Funny)
Parent
Instruction Set (Score:5, Insightful)
I'll be interested to see what they're going to do about making it easier to program. Wire delay's going to be exposed as hops on the on-chip network. IMHO, the toolchain side's far more interesting to me than shoving a bunch of cores together on an on-die network....
Assuming they did anything interesting on the toolchain side.
Re:Instruction Set (Score:4, Informative)
Also FTA: "I'm due to talk to the head of Tilera's software team, which is actually larger than the company's hardware team."
I'll be very curious what their development toolchain ends up looking like, but it seems clear they understand the issue.
Parent
Re:Instruction Set (Score:5, Funny)
He must be a really fat guy!
Parent
Re: (Score:3, Funny)
No, he just has a REALLY big head.
The real question is (Score:3, Insightful)
Re: (Score:2, Interesting)
This is the same problem we've been working with on clusters forever...How do you tune and load balance the jobs to the point where you're getting the most out of your hardware, and nothing is sitting idle while other parts of the system are running at 100%? What do you do when the task is already reduced to the simplest level and there is no be
Re:Instruction Set (Score:4, Interesting)
The solution, of course, is to move away from the imperative programming model to dataflow [wikipedia.org] or functional [wikipedia.org] one. That way the compiler can automatically parallelize the task, instead of the programmer having to do so manually.
Parent
Re:Instruction Set (Score:4, Interesting)
A chip is basically built as follows
poly
metal
poly
Si
While there are some technologies (SOI for example) that may allow this in theory, you start to run into other issues like trying to punch through the insulator in specific areas and with high precision (neither of which is easy), heat dissipation (transistors are transistors, and switching produces heat, doesn't matter if it's an ALU or a SRAM). And finally before someone suggests using the other side of the wafer, how do you connect the two sides? A wafer is *very* thick in the scale we are discussing. It would be like mining a hole through the earth.
More useful would perhaps be distributing L0 cache (register memory) a little more liberally in key areas of the processor, but then addressing gets in the way. In theory having a MCM (multi chip module) with Cache - Processor - Cache so there is ample L3 cache running at core/4 clock may help, but costs get prohibitive.
There is no really good solution to moving data around once you start getting to these kinds of density. Eventually wire delay may be the limiting factor to CPU throughput.
-nB
Parent
Re:Instruction Set (Score:4, Informative)
Intel did this a swell and redesigned the Pentium 4 on it.
The old method of bonding two wafers also works. Smart censors, for instance, bonds a photodetector material (a semiconductor like InGaAs or InSb) onto the top of a cmos chip. The bonding was very expensive, of course, but it is definitely possible to grow a semiconductor on top of existing metal/polysilicon.
Parent
Re: (Score:2)
Re:Instruction Set (Score:5, Informative)
""If you have an application written for any multi-core or single processor architecture that's written to work with Linux, you can take it, compile it and have it running on our chip in minutes," he said. "Now, if you want to ratchet up the performance, we provide libraries and interface mechanisms that customers can use to tune code."" from here [theregister.co.uk]
Parent
Re: (Score:3, Informative)
Until I see some results of dynamically-compiled C code that runs really fast on this thing, I don't see it offering better solutions than, say, an FPGA. The exception would be if this was much lower-powered.
It's not theoretically impossible to do. Instead of treating it like a CPU, treat it like a network with micro-ops treated like packets. Run ea
Re: (Score:3, Insightful)
Contrary to the summary and your remark, I'm not sure it's Tile64's problem to bring parallel programming to the masses. First, because many-core chips are already useful (and present no special difficulties) for servers that handle many simultaneous connections - in other words, reducing the space and electricity requirements of server farms. That's a sign
bulk pricing (Score:2, Funny)
Correct! 6000 cores (Score:5, Funny)
Now that I have a 64 core CPU... (Score:2, Funny)
obligatory (Score:2, Redundant)
Re: (Score:2)
It will take only 1024 of these to have the same number of processors as the Connection Machine.
http://en.wikipedia.org/wiki/Connection_Machine [wikipedia.org]
--
BMO
I Did RTFM, and there's key info missing (Score:5, Insightful)
Without those bits of information, it's impossible to guage exactly who might night this chip, and how successful it might be.
Re: (Score:2)
Judging from the applications they mention (networking / video stuff) I'm guessing it doesn't have much floating point performance.
Re:I Did RTFM, and there's key info missing (Score:5, Informative)
The watts isn't missing:
TFA says its between 175 and 300 milliwatts per core - do the math. 12 to 19 watts. They're targetting the embedded market (and with those low power consumption figures, I think a super laptop would be a no-brainer).
Parent
Re: (Score:3, Informative)
FPGA's (particularly ones from Xilinx) that offer similar logic horsepower (assuming you had a digital designer to write your VHDL for your) for less than 500mW.
The latest
Re: (Score:3, Informative)
Re: (Score:3, Funny)
Warning: Sarcasm above may cause irritation of skin and explosion of monitor.
Re: (Score:2)
1. Does it matter? It's useful for computer architects to know for comparison but doesn't matter for the end user. I'm curious, too, but that can wait. Doesn't matter for system designers even.
2. They list 170-300 mW/core, but that's not clear as to what the base power is for the peripherals and routers. Is that (900 mhz) 300 mW * 64 ( about 20 W ) for the whole
I'm ready for it (Score:2, Interesting)
> ps aux | wc -l
281
Of course not all those processes are in runnable state. On the other hand, many of those processes have multiple threads. A typical Java Swing GUI app may have a dozen threads, for example. A web server process can easily have dozens of runnable threads. Software is going to take a little bit of catching up, but nothing huge.
Re:I'm ready for it (Score:4, Insightful)
It's very hard to take advantage of multiple cores because very often, there isn't more than one thing for a program to be doing at the same time, and for most desktop users, there are rarely more than 1 or 2 programs running actively at a time. Many code paths are not explicitly parallelizable, and many more are parallelizable but not easily so. Just as clock speed is not the holy grail of processor performance, core count isn't either.
Parent
Rumored... (Score:5, Funny)
*Required 32 GB of RAM not included.
Instruction set? (Score:3, Insightful)
They'll probably market running Java as a strong point.
(Then again, does it run Linux?)
Re: (Score:3, Insightful)
wow. (Score:3, Funny)
i wouldn't hold my breath.
Tequila128 (Score:4, Funny)
The Tequila128. Free copy of virtual beer pong included.
But does it... (Score:5, Informative)
Re:But does it... (Score:5, Informative)
If you look at their block diagram this looks more like an FPGA-on-drugs than a CPU.
The individual blocks are probably programmed with GCC, since it should be trivial to port it to a MIPS-like architecture. I wonder if the interconnect uses a VHDL type language or if they rely on their weird cache to build efficient shared memory.
Either way, it looks like you have to keep in mind the architecture while designing your software. I doubt they can build a compiler that can manage the division of labor.
Unlike a typical multicore design you wouldn't use this to parallelize a multithreaded application or a multiprocess workload. The center processors will have a very different latency characteristic than the edge ones, and you want the parts that interact with the network to be on the points adjacent to the controllers, for example.
So it should work great for an especially designed system, but not so great as a general purpose CPU
Parent
Tilera MDE (Score:3, Informative)
ummm... Isn't Sun's T2 running 256 threads? (Score:2)
Re: (Score:3, Informative)
Because this has 64 cores as opposed to 8 cores on either the T1 or T2?
Because the total number of threads supported by an 8 core T2 is 64 and not 256 as you wrote above?
This was my companys idea in 2001 (Score:5, Interesting)
It's was called Enumera www.enumera.com
I started to work with Chuck Moore, the author of the FORTH Language on a 7X7 array of very fast small processors.
From at talk I did, February 16, 2001
From http://www.dnull.com/~sokol/amorp/emtalk.ppt [dnull.com]
build. Co-processors could also be added.
Each CPU's would be operating at 2400 MIPS x 49 for a total of 117 Billion operations per second.
The power consumption would be 1 watt 1.8 Volts a 500 mA.
With this level of computing power new applications that were unthinkable before, now become possible.
http://developers.slashdot.org/comments.pl?sid=13
And earlier here:
http://www.colorforth.com/ [colorforth.com] 25x Multicomputer Chip
This eventually became IntellaSys after Enumera failed.
http://www.findarticles.com/p/articles/mi_m0EIN/i
http://www.findarticles.com/p/articles/mi_m0EIN/i
Also for older info see:
Specifically look at the P21 / I21/ F21 chips...
http://www.enumera.com/chip/ [enumera.com]
http://www.ultratechnology.com/ml0.htm [ultratechnology.com]
http://www.ultratechnology.com/f21.html#f21 [ultratechnology.com]
http://www.ultratechnology.com/store.htm#stamp [ultratechnology.com]
http://www.ultratechnology.com/cowboys.html#cm [ultratechnology.com]
Re: (Score:3, Interesting)
I put $100,000 Cash and almost 2 years worth of work into this and got nothing, no one was even interested.
But then I see a Bunch of MIT weenies do it and they get all kinds of attention as something new and revolutionary 6 1/2 years later.
There is also a real chance they took the idea right off my web site or slashdot
Re:This was my companys idea in 2001 (Score:4, Insightful)
I put $100,000 Cash and almost 2 years worth of work into this and got nothing, no one was even interested.
I'm not sure why the frustration. I'm sure multi-core was not just your original idea. If you're in the industry you know that:
1. IT is rich on ideas, poor on implementation.
2. Marketing a product is just as (if not more) important than making a product.
3. Most businesses fail in the first 5 years. And this one may be no exception. They didn't exactly enjoy massive success just yet. They got few crappy articles and landed Slashdot. Kind of hard for a hardware company to cash in on that alone.
There design really looks like it was lifted straight off my paper. So I guess at least I am exposing some plagiarisms.
You don't expose plagiarism by venting frustration on Slashdot: where are your patents. How's there guarantee you're the originator, and how's there guarantee they *stole* your work versus reinvent it independently, which happens often with technology that's in a boom (i.e. multi-core designs). There's a reason the patent system exists, forget the grab you read here about patents on Slashdot.
Parent
Re:This was my companys idea in 2001 (Score:5, Informative)
Up till now there were only 2 types of Parallel processing.
1.) loosely coupled. Thinking Machines & beowulf clusters for example are using this, these are interconnected with Ethernet or some other Network medium and send messages back and forth.
2.) Tightly coupled, this is SMP, NUMA, SNOOPY, basically shared memory system where each processor shares the same global memory space.
Each requires very different programming strategies and are limited to certain types of problems.
There is also a third form that is lesser know. This systolic arrays. An example of this is TimeLogic, and many DOD type projects.
This is usually done with a bunch of FPGA's and the math computations are done as a series of hardware pipelines without any CPU.
With the parallel core processor it's possible to make it like an SMP (share memory) type system, but you really get hammer with the memory bottleneck so after about 4 CPU's you don't really gain much.
What I had proposed with doing systolic array type of processing but with Simple but fast CPU's on one chip.
They would be connected with CPU registers that would pass data directly from one CPU to the next.
It's design would allow super tight coupling between each processor, so a programming problem wouldn't need to process a buffer at a time but could tackle problems that can't normally be broken up into parallel operations. For example a bignum math operation like multiplying 2 number that are 1024 bits long. Or large FFT, fast DVT, or matrix operations where each cpu could process part of a single operation that must be done serially, and can not be done using traditional parallel processing.
Specifically my interest was in video compression and image processing in real time. This is where DCT, motion vector searches Huffman coding and other operations that don't parallelize well would really get a boost using this type of processor.
Parent
Let the geeks solve the problem (Score:3, Interesting)
Build a USD1000 desktop workstation, port Debian Linux to run on it and let the geeks out there adopt it.
There is no better way to explore a device's capabilities than to let the market do it.
I want one for myself. I am tired of the x86 architecture.
I for one (Score:4, Funny)
I, for one, parallel welcome our new beowulf joke superseding overlords.
I, for one, parallel welcome our new beowulf joke superseding overlords.
I, for one, parallel welcome our new beowulf joke superseding overlords.
I, for one, parallel welcome our new beowulf joke superseding overlords.
Re:I for one (Score:4, Funny)
IIIIIIII,,,,,,,, ffffffffoooooooorrrrrrrr oooooooonnnnnnnneeeeeeee,,,,,,,, etc.
fava
Parent
Re: (Score:2)
Re: (Score:3, Insightful)
Re: (Score:2)
It will be interesting to see how this works out. In practice, the development tools seem to be primary. If I can't develop for it easily, people don't and the product fails. Hence, you need development / cross development tools.