Tilera To Release 100-Core Processor 191
angry tapir writes "Tilera has announced new general-purpose CPUs, including a 100-core chip. The two-year-old startup's Tile-GX series of chips are targeted at servers and appliances that execute Web-related functions such as indexing, Web search and video search. The Gx100 100-core chip will draw close to 55 watts of power at maximum performance."
This is great ! (Score:5, Interesting)
I can't wait to see the output of :
cat /proc/cpuinfo
I guess we will need to use:
cat /proc/cpuinfo | less
When we reach 1 million cores, we will need to rearrange the output of cat /proc/cpuinfo to eliminate redundant information ;-))
By the way I just typed "make menuconfig" and it wiil let you enter a number up to 512 in the "Maximum number of CPUs" field, so the Linux kernel seems ready for up to 512 CPUs (or cores, they are handled the same way by Linux it seems) as far I can tell by this simple test. Entering a number greater than 512 gives the "You have made an invalid entry" message ;-(
Note: You need to turn on "Support for big SMP systems with more than 8 CPUs" flag as well.
obligatory (Score:2, Funny)
... and just imagine a Beowulf cluster of them.
Re: (Score:3, Insightful)
It IS a Beowulf cluster.
Obligatory Princess Bride quote:
Miracle Max: Go away or I'll call the brute squad!
Fezzik: I'm ON the brute squad.
Miracle Max: [opens door] You ARE the brute squad!
Re: (Score:2)
but we already have... (Score:2)
Re:This is great ! (Score:5, Informative)
Re: (Score:3, Funny)
But the more important question is...
Will it run Windows 7.
I know, I know, its the wrong questions, but the answer to the other one is always "yes".
Re: (Score:3, Funny)
Give it a break, shillboy
We've all seen more than enough paid endorsements of Microsoft's latest exercise in blandness.
Settle down, Linus.
Re:This is great ! (Score:5, Insightful)
By the way I just typed "make menuconfig" and it wiil let you enter a number up to 512 in the "Maximum number of CPUs" field, so the Linux kernel seems ready for up to 512 CPUs (or cores, they are handled the same way by Linux it seems) as far I can tell by this simple test. Entering a number greater than 512 gives the "You have made an invalid entry" message
Whoa. If you change the source a little, you can enter 1000000 into the Maximum number of CPUs field! Linux is ready for up to a million cores.
If you change the code a little more, when I enter a number that's too high for menuconfig, it says "We're not talking about your penis size, Holmes"
Re: (Score:2)
And if you change the code a little more, it takes single-threaded tasks and automatically finds an efficient parallelization of them, distributing the work out to those million cores!
Re: (Score:3, Insightful)
Actually, some algorithms (like fluid simulation and a very large neural net) are not that hard to parallelize to run on a million cores.
Re: (Score:2)
Yes, but taking an arbitrary single-threaded algorithm and automatically figuring out what the parallelization is is the hard part. =]
Re: (Score:2)
Well, you could analyze the data dependencies and put them into a dependency graph, and then figure out what can be parallelized without having too much synchronization overhead. However, that's probably something for a theoretical scientific paper, and I'd be surprised if you could paralellize most algorithms to split to more threads than you could count on one hand...
As soon as you're doing linear I/O (like network access), you've hit a barrier anyways.
Bottlenecks-R-Bugs (Score:2)
Re:This is great ! (Score:4, Interesting)
Re: (Score:3, Informative)
Re: (Score:3, Insightful)
Actually, some algorithms (like fluid simulation and a very large neural net) are not that hard to parallelize to run on a million cores.
Building the memory backplane and communication system (assuming you're going for a cluster) to support a million CPUs is non-trivial. Without those, you'll go faster with fewer CPUs. That's why supercomputers are expensive (it's not in the processors, but in the rest of the infrastructure to support them).
Yep (Score:5, Informative)
Unfortunately these days the meaning of supercomputer gets a bit diluted by many people calling clusters "supercomputers". They aren't really. As you noted what makes a supercomputer "super" isn't the number of processors, it is the rest, in particular the interconnects. Were this not the case, you could simply use cheaper clusters.
So why does it matter? Well, certain kinds of problems can't be solved by a cluster, just as certain ones can. To help understand how that might work, take something more people are familiar with like the difference between a cluster and just a bunch of computers on the Internet.
Some problems are extremely bandwidth non-intensive. They don't need no inter-node communication, and very little communication with the head node. A good example would be the Mersenne Prime Search, or Distributed.net. The problem is extremely small, the structure of the program is larger than the data itself. All the head node has to do is hand out ranges for clients to work on, and the clients only need to report the results, affirmative or negative. As such, it is something suited to work over the Internet. The nodes can be low bandwidth, they can drop out of communication for periods of time and it all works fine. Running on a cluster would gain you no speed over the same group of computers on modems.
However the same is not true for video rendering. You have a series of movie files you wish to composite in to a final production, with effects and so on. This sort of work is suited to a cluster. While the nodes can work independent, the work of one node doesn't depend on the others, they do require a lot of communication with the head node. The problem is very large, the video data can be terabytes. The result is also not small. So you can do it on many computers, but the bandwidth needs to be pretty high, with low latency. Gigabit Ethernet is likely what you are looking at. Trying to do it over the Internet, even broadband, would waste more time in data transfer than you'd gain in processing. You need a cluster.
Ok well supercomputers are the next level of that. What happens when you have a problem where you DO have a lot of inter-node communication? The result of the calculations on one node are influenced by the results on all others. This happens in things like physics simulations. In this case, a cluster can't handle it. You can slam your bandwidth but worse, you have too much latency. You spend all your time waiting on data, and thus computation speed isn't any faster.
For that, you need a supercomputer. You need something where nodes can directly access the memory of other nodes. It isn't quite as fast as local memory access, but nearly. Basically you want them to play like they are all the same physical system.
That's what separates a true supercomputer for a big cluster. You can have lots of CPUs and that's wonderful, there are a lot of problems you can solve on that. However that isn't a supercomputer unless the communication between nodes is there.
Re: (Score:2)
Re: (Score:3, Interesting)
Re: (Score:2)
How many cores does it take to run a parallel algorithm?
100 - 1 to do the processing, 1 to fetch the data and 98 to calculate an efficient way to make the whole thing run in parallel.
Re: (Score:3, Funny)
Whoa. If you change the source a little, you can enter 1000000 into the Maximum number of CPUs field! Linux is ready for up to a million cores.
640K cores is more than anyone will ever need.
Re: (Score:3, Funny)
No you really need 16,711,680 cores. So you have one core for every cell in a standard Excel 2003 sheet (Yea I know 2007 finally gave us more space)
So 65,536 rows by 255 columns. A CPU for each sell processing its own value. Excel may almost run fast.
Re: (Score:2)
I know 2007 finally gave us more space
My god... You're one of THEM!!!!
Re: (Score:2)
"More seriously, do you have any reference for "Linux is ready for up to a million cores" ?"
SGI has 4096-core monsters, as MrMr pointed out.
Do you have a million-core machine we can use to invalidate this hypothesis?
Re: (Score:2)
Those 4096 core SGI machines are clusters of 4-core machines with a very fast interconnect. Each cluster node runs its own local software with some quite evil stuff (custom memory controller and some extra logic in the VM subsystem for cache coherency across nodes) to handle distributed shared memory and process migration. These are not SMP machines and, although most of the relevant code is in the mainstream kernel sources, it is so tied to SGI's architecture that it is almost completely useless from the
Re:This is great ! (Score:5, Informative)
Sources are always appreciated when you tell us something.
Here is the source: http://www.kernel.org/ [kernel.org]
Re: (Score:2)
Common! quit being such a tough guy and let us know where it says so...
grep -r "1,000,000" /usr/src/linux /usr/src/linux/drivers/net/qlge/qlge_ethtool.c: * We do this by using a basic thoughput of 1,000,000 frames per /usr/src/linux/kernel/cpuset.c: * per msec it maxes out at values just under 1,000,000. At constant
grep -ri "one million" /usr/src/linux /usr/src/linux/arch/x86/math-emu/README:found at a rate of 133 times per one million measurements for fsin. /usr/src/linux/arch/x86/math-emu/README:was obt
Allow ia64 to CONFIG_NR_CPUS up to 4096 (Score:5, Informative)
CC.
Re:This is great ! (Score:5, Informative)
The information in cpuinfo is only redundant like that on x86/amd64...
On Sparc or Alpha, you get a single block of text where one of the fields means "number of cpus", example:
cpu : TI UltraSparc IIi (Sabre)
fpu : UltraSparc IIi integrated FPU
prom : OBP 3.10.25 2000/01/17 21:26
type : sun4u
ncpus probed : 1
ncpus active : 1
D$ parity tl1 : 0
I$ parity tl1 : 0
Cpu0Bogo : 880.38
Cpu0ClkTck : 000000001a3a4eab
MMU Type : Spitfire
number of cpus active and number of cpus probed (includes any which are inactive)... a million cpus wouldn't present a problem here.
Re:This is great ! (Score:5, Insightful)
Re: (Score:2)
Good point. But since it wouldn't be hard to add this to /sys, (and I see some of that info already there) I suspect that nobody has really needed it in that format yet. Also, if you're going to get more than a couple pieces of that, /proc/cpuinfo has it nicely in one place and is far from hard to parse.
Re: (Score:2)
... on x86. Now port your code to PowerPC. Oh, sorry, different format, fields have different names. Write a new parser. Now port it to ARM. Oh, sorry, different format, fields have different names, some of the information isn't there. Now try porting it to SPARC, oh, sorry, can't be bothered supporting Linux, waste of developer time.
Re: (Score:3, Informative)
Take a look at /sys/devices/system/cpu: it has information about cpu topology, cpu hot-swap, cache sizes and layout across cores, current power state, etc.
It's all there, in an architecture-independent way in /sys/devices.
Re: (Score:2)
cat /proc/cpuinfo | less
That gets modded interesting these days? The use of a pipe?
If that's not too basic to be considered interesting then moderators have got a odd idea about what interesting actually means.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
because in 2009 CPU power and memory are cheaper than dirt. or didn't you notice we're discussing a 100-core CPU ?
with capacities like that, even firing MS word to edit a plain text file, instead of notepad, is not too costly anymore... and no, i won't apologize, say i was kidding or any other shenanigans. i really mean it.
Re: (Score:2)
And then when you use this same pattern in a concurrent find operation, and you end up with 2,000 processes running instead of 1,000, and each read operation being turned into a read, copy, write, read, sequence (which is what happens if you use cat like this), then it's still a good idea?
No matter how fast computers become, a complex and slow solution to a problem is never better than a simple and slow one. At the very least, typing 'cat /proc/cpuinfo | less' takes more time than typing 'less /proc/cpuin
Re: (Score:2)
I think that "Useless Use of cat" is funny. I really do. I go back and read it every once in a while just for grins.
But we're in the future, now. Spawning that extra process isn't going to hurt anything. Yeah, it's fun to poke at people who do silly things like that, but in reality, there's rarely harm in doing things this way. Even if you're using a shell script which will run "cat file | grep" over and over, you're probably not going to start thrashing on a modern CPU.
Re: (Score:2)
First, and importantly, it is more to type. Getting into the habit of doing more work than you need to is never a good idea.
Secondly, it is a much bigger overhead than you might think. With 'less {file}' the less process just reads the data directly. The kernel copies it out of the VM cache and into the process's buffer. Sometimes it doesn't even do that. Both less and grep will sometimes use mmap(). In that case, the kernel just updates the page tables and the data is never copied, it's just DMA'd f
Re: (Score:2)
It's not less to type once you've already typed "cat /proc/cpuinfo" and then realized -- dangit, I have to paginate that."
Basically, your post is equivalent to advocating writing your own bubblesort implementation, because it's fast enough on small data sets with modern processors, rather than using the system-provided quicksort function. It's a bad habit, and the fact that it isn't too bad in certain situations doesn't mean it's something that should be encouraged.
It's like using system-implemented bubblesort over system-implemented quicksort because you're using to typing bubblesort. When you realize that you actually need something faster, you can switch. You're advocating Premature Optimization [wikipedia.org], which Knuth warns against.
Re: (Score:2)
Re: (Score:2)
When we reach 1 million cores, we'll probably be able to ask the computer what's on his mind...
Re: (Score:2)
Nah!, I am lazy... when I realize the file is to big, it is faster for me to add the pipe at the end of the line than to edit the beginning of the line ... ;-)
Re: (Score:2)
that's what the 'home' key is for :p
Awfully generous with the term "core" (Score:2, Insightful)
Yes, I suppose technically any FPGA could be considered a "core" in its own right, but it's a far cry from the CPU cores that you typically associate with the term.
Putting a stock on a semi-automatic rifle makes it an "assault weapon", but c'mon. It's still a pea shooter.
When does a CPU become the CPU? (Score:5, Interesting)
Re:When does a CPU become the CPU? (Score:5, Interesting)
How does this fix the apps they ported being mostly IO bound in a lot of cases and 99% of the cores will still just be eating out of their noses?
Loads and loads of RAM/cache, possibly?
Re:When does a CPU become the CPU? (Score:5, Informative)
The Register [channelregister.co.uk] goes into more detail than this article, as usal.
So it seems pretty standard and they're using existing open & closed source MIPS toolchains, however there's still "will" and "are being" in that sentence which brings a little unease...
Custom ISA? (Score:5, Insightful)
Re: (Score:3, Informative)
In general, new instruction sets are mostly interesting in the custom software and the open source software areas. But the latter is quite a large chunk of the server market, so I suppose they could live with that.
They would need to get support into gcc first, though.
Re:Custom ISA? (Score:5, Informative)
From a quick Google - its based on the ARM core (easily licensable cpu core)
Re: (Score:2)
From a quick Google - its based on the ARM core (easily licensable cpu core)
Must be a coincidence, but I was just thinking a week ago why nobody's tried to make a many-core CPU by doing a cookie-cutter job and just replicating a simple ARM core a bunch of times... looks like someone has!
Re:Custom ISA? (Score:5, Informative)
Re: (Score:2)
64-bit VLIW instructions, 2 ALUs, 1 load store unit (3 ops/clock) I'm going to guess 32 registers (ala MIPS) - that means 3+3+2=8x(log2 32 = 5) = 40 bits to encode registers 8+8+8 to encode opcodes which seems maybe too many - perhaps 64 registers 48 bits of regs and 16 of opcodes?
no FPU though sadly
Re: (Score:2)
You can always offload your number crunching to a GPU with OpenCL...
Re: (Score:2)
That's a coincidence, I was thinking that when you get to that may cores, you're effectively producing something akin to a VLIW processor, with each instruction handed to its own execution system.
why not go to the source? (Score:3, Informative)
The company [tilera.com] website claims...
64-bit VLIW processors with 64-bit instruction bundle
3-deep pipeline with up to 3 instructions per cycle
I don't know how this could be considered ARM or MIPS-derived...
A better description might have been in this article [linuxfordevices.com]...
Re: (Score:2)
how about those in some netbooks and a beowulf cluster of those?
LoB
Re:Custom ISA? (Score:4, Informative)
Why was this modded Informative? Can we have any links? Because both the article here as well as Wikipedia and an old Ars Technica story claim that it's based on MIPS.
Re:Custom ISA? (Score:4, Insightful)
1. LLVM backend
2. Grand central
3. ???
4. Done.
Seriously though, this is exactly what Apple have been working towards recently in the compiler space. You write your application and explicitly break up the algorythm into little tasks that can be executed in parallel. Using a syntax that is light weight and expressive. Then your compiler tool chain and runtime JIT manages the runtime threads and determines which processor is best equipped to run each task. It might run on the normal CPU, or it might run on the graphics card.
FreeBSD and GCD (Score:3, Interesting)
Although I don't expect Apple to release an Apple Server edition with a Tilera multicore processor, I would be interested to see a version of FreeBSD running with Grand Central Dispatch on a Tilera multicore chip. It would give a good idea of how effective GCD would be in allocating cores for execution. Any machine with 100 cores must have a considerable amount of RAM, perhap 8GB+, even with large caches.
Apple has been very active in developing LLVM compilers, and has recently added CLANG front end, inste
Re: (Score:2)
Oh, and the version of clang that Apple ships as 1.0 is a branch from the main tree from a few weeks before the official 1.0 release was branched. Apple puts a lot of developer effort into clang, but so do other people (including myse
Re: (Score:2)
"Seriously though, this is exactly what Apple have been working towards recently in the compiler space. You write your application and explicitly break up the algorythm into little tasks that can be executed in parallel. Using a syntax that is light weight and expressive. Then your compiler tool chain and runtime JIT manages the runtime threads and determines which processor is best equipped to run each task."
AAAAAAAAHHHHH!!!! It's the iPod all over again! Apple did not invent the thread pool! I'm sure Gran
Re: (Score:2)
Re: (Score:2)
Oh I'm not saying it's not innovative, I'm not saying they don't (or didn't) do good, interesting and cutting edge research, it just annoys me that some folks think that they invented the thread pool/job queue model.
Re: (Score:2)
Re: (Score:2)
"...if the instruction set isn't any standard type..."
No problem; use the processor for a 'speak and spell'-type toy, a drug store reusable digital camera or a scientific calculator and someone will hack a decent Linux kernel onto it over a weekend.
Re: (Score:2)
They have a C compiler. That's all we need.
100? (Score:3, Insightful)
Wouldn't it have been better to make it a power of 2? Some work is more easily divided when you can just keep halving it. 64 or 128 would have been more logical I would have thought. I'm not an SMP programmer thought, so perhaps it doesn't make any difference.
Re:100? (Score:5, Funny)
Re:100? LOL (Score:2)
Wish I had mod points today. I wonder how many people will get just how funny this fantastically sarcastic and totally on target comment was. Bravo.
Re:100? (Score:5, Informative)
SMP FAQ.
Q: Does the number of processors in a SMP system need to be a power of two/divisible by two?
A: No.
Q: Does the number of processors in a SMP system...
A: Any number of CPUs/cores that is larger than one will make the system an SMP system*.
(* except when it's an asymmetrical architecture)
Q: How do these patterns (power of 2, divisible by 2, etc) of numbers of cores affect performance?
A: Performance depends on the architecture of the system. You cannot judge by simply looking at the number of cores, just as you can't simply look at MHz.
Re:100? (Score:5, Funny)
Re: (Score:2)
Sounds Like (Score:2)
But does it run linux? (Score:2)
In TFA sez it's ported to apache. Might be useful.
What ISA? (Score:2)
Re: (Score:3, Informative)
No, they are derived from the MIPS architecture.
Been there, done that, got the T-Shirt... (Score:5, Interesting)
OK, so big disclaimer: I work for Sun (not Oracle, yet!)
The Sun Niagara T1 chip came out over 3 years ago, and it did 32 threads on 8 cores.
And drew something around 50W (200W for a fully-loaded server). And under $4k.
The T2 systems came out last year, do 64 threads/CPU for a similar power budget. And even less $/thread.
The T3 systems likely will be out next year (I don't know specifically when, I'm not In The Know), and the threads/chip should double again, with little power increase.
Of course, per-thread performance isn't equal to anything like a modern "standard" CPU. Though, it's now "good enough" for most stuff - the T2 systems have a per-thread performance equal to about the old Pentium3 chips. I would be flabbergasted if this GX chip had a per-core performance anywhere near that.
I'm not sure how Intel's Larabee is going to show (it's still nowhere near release), but the T-seres chips from Sun are cheap, open, and available now. And they run Solaris AND Linux. So unless this new GX chip is radically more efficient/higher-performance/less costly, I don't see this company making any impact.
-Erik
It would be clever (Score:3, Insightful)
Since a) developing a processor is insanely expensive and b) they need it to run lots of software ASAP, it would be very clever if they spent a marginal part of the overall development costs in making sure every key Linux and *BSD kernel developer gets some hardware they can use to port the stuff over. Make it a nice desktop workstation with cool graphics and it will happen even faster.
They are going up against Intel... The traditional approach (delivering a faster processor with a better power consumption at a lower price) simply will not work here.
I think Movidis taught us a lesson a couple years back. Users will not move away from x86 for anything less than a spectacular improvement. Even the Niagara SPARC servers are a hard sell these days...
Chips target tasks (Score:2)
Can someone explain to me how a chip can be targetted at much higher-level tasks like these?
I realize there are surely technical means to achieve this goal, I just can't imagine myself what these means could be.
Re: (Score:2)
Re: (Score:2)
An associative memory requirement could be better served by a custom high-core count, CPU ... if it has sufficient memory on board (e.g. sufficient total memory bus bandwidth).
hmm... (Score:2, Funny)
Re: (Score:2)
Makes me glad I've been learnig Clojure (Score:2)
Clojure is a lisp on the JVM designed for multi-threading. From:
http://clojure.org/ [clojure.org]
"""
Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR ). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language - it compiles directly to JVM bytecode, yet remains completely dynamic. Every
Comment removed (Score:4, Funny)
15-bladed shaving razor (Score:2, Interesting)
Re: (Score:3, Insightful)
Yes, indeed. The memory bus is usually the bottleneck here... unless you switch from SMP to NUMA architecture, which seems necessary for anything with more than, say, 8 to 16 cores.
asymmetric (Score:3, Interesting)
It's been reported that these cores will be relatively underpowered, though both the total processing power and cost per watt will be quite impressive. This makes the chip appropriate for putting in a server but not so much a desktop machine, where CPU-intensive single-threads may bog things down.
So what about one of these in combination with a 2-, 3- or 4-core AMD/Intel chip? The serious threads can be run on the faster chip, while all the background stuff can be spread among the slower cores? Does Windows have the ability to prioritize like that? Does Linux?
Dancing Hamsters... (Score:3, Funny)
It is like 100 Dancing Hamsters in your CPU.
Re: (Score:2)
100 cores plus some room on the chip for management, connections, global cache etc ....
Plus if you say 100 cores and put 128 cores on the chip then 28 can fail before you have to bin the chip as a dud ....
Re: (Score:2)
They arent going to intentionally roast up to 28 cores on every unit just to hit their advertised number.
Power of two is not at all necessary (Score:2)
It is done only out of convince really. So you have your regular 1 core processor of course (2^0), next step up is a second core (2^1). Now from there, an easy step is to simply duplicate your dual core setup. You just make a second copy and put it on the same chip giving you 4 cores (2^2). This is as far as most chips go, more than 4 cores is not real common. However you might notice we have a real small sample set, we've only covered 3 powers of two, two of them by necessity. This trend thus isn't one bec
Re: (Score:2)
11x11 + 2x2 + 1x1 + 1 layers of cpus.
The third dimension called, they are suing flatland for prior art and copyright infringement.
Re: (Score:2)
The third dimension called, they are suing flatland for prior art and copyright infringement.
The fourth dimension called, they already have (wioll haven) the judgment from the lawsuit, and flatland stands (willan on-stand) on parody.
Re: (Score:2)
Where's the law that says the core layout, or even the die itself, has to be square? Square, or nearly square, might be the most convenient for minimum paths and such. Still, you need to have space somewhere for "between core" control circuits. Even if you lay out the die in a nice square grid, you don't have to make each cell be a core. Getting data lines into the cores in the middle can be an interesting challenge. But then, 100 cores trying to load a word from different locations in RAM all at the s
Re: (Score:2)