Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Open Source Education Software Hardware Technology

Princeton Researchers Announce Open Source 25-Core Processor (pcworld.com) 114

An anonymous reader writes: Researchers at Princeton announced at Hot Chips this week their 25-core Piton Processor. The processor was designed specifically to increase data center efficiency with novel architecture features enabling over 8,000 of these processors to be connected together to build a system with over 200,000 cores. Fabricated on IBM's 32nm process and with over 460 million transistors, Piton is one of the largest and most complex academic processors every built. The Princeton team has opened their design up and released all of the chip source code, tests, and infrastructure as open source in the OpenPiton project, enabling others to build scalable, manycore processors with potentially thousands of cores.
This discussion has been archived. No new comments can be posted.

Princeton Researchers Announce Open Source 25-Core Processor

Comments Filter:
  • by Anonymous Coward

    just shit his pants!

  • Does it mean any sanctioned country can order their own processors from a generic manufacturer?

    • by John Smith ( 4340437 ) on Thursday August 25, 2016 @06:11PM (#52771991)
      Relax. In between architectural basis and the relatively low performance, it's insignificant. A few hundred million transistors for a 25 core chip in a day where your stock chip is multibillion in terms of transistor count.
    • by AHuxley ( 892839 )
      Any hardware they buy in to think about expanding their ability to build a super computer comes with free NSA and GCHQ hardware added during shipping. e.g.
      people may recall DEITYBOUNCE, IRONCHEF, MONTANA, BULLDOZER, KONGUR, NIGHTSTAND.
      So it then becomes a race to buy in safe top end consumer kit and fill a hall or older super computers without attracting the FBI, CIA, MI6 while exporting.
      Nothing allowed to be floating in the educational or consumer realm will really help.
      • by gtall ( 79522 )

        You do know that everyone in the U.S. and Britain has a government "minder" assigned to them to watch their every move, yes?

        • You do know that everyone in the U.S. and Britain has a government "minder" assigned to them to watch their every move, yes?

          Yes, but they can't be trusted. That's why we have minder minders who keep an eye on the minders.

          But who keeps an eye on them you ask? Fool! It's minders all the way down!

    • What are you talking about? What "sanctioned countries"? Unless you're talking about US manufacturers, of course, however, the prominent one, meaning Intel, is not going to act as a fab for other random people anyway, so that's a moot point.
      • Uh, that's precisely what they do. They have their Custom Foundry [intel.com], where one can have any agreement w/ them. And if it's a non-US company that needs them manufactured outside the US, they can have Intel make their parts in Israel
        • I said RANDOM people. Intel may choose to agree to manufacture something for you, but if it smells of competing with Intel, well, tough luck. And manufacturing CPUs replacing Intel models is definitely is going to smell to Intel like competing with Intel.
          • No, if they stand to get money from me by making something for me, they'll do it. They may refuse if I'm violating anyone's patents, or if I wanted to make an x64 after getting an AMD license, they may balk, but otherwise, they'd be just fine. The custom foundry business is an independent business unit.
  • I've been hearing about massive number of cores for years ... the problem however is they are great for demonstrating that you can put a bunch of 'cores' on a chip ... not that they are actually useful for anything.

    Connecting 8k of these things together? You've just proven you actually don't understand how the real world does things.

    If you have 8 million cores that can add 20 super floating point numbers a second ... thats WORTHLESS because I need to do things other than add two numbers.

    If you have 8k core

    • by rubycodez ( 864176 ) on Thursday August 25, 2016 @08:55PM (#52772711)

      real computers solving real problems with large core counts exist, and they have non-bus architectures by the way.
      So according to you the cpus in the Sunway TaihuLight supercomputer with 256 cores per cpu don't really do anything?

      I think you don't have a background in the field to be making such pronouncements, you're spewing out of your ass

    • by D.McG. ( 3986101 ) on Thursday August 25, 2016 @10:35PM (#52773055)
      Nvidia has a wonderful 3840 core processor with a wonderful scheduler and interconnect. Two can be bridged for 7680 cores. Hmmm... Your argument of 8000 cores being a pipe dream is complete rubbish.
    • by AHuxley ( 892839 )
      Wrap it up in great marketing and sell it back to a government for crypto? No need to buy in a big brand super computer, lots of our small CPU's can do that and be expanded on as needed. At a per core price thats some nice pay back to a contractor.
    • funny how that solution kills the theoretical performance

      Yeah, what are those dumbasses at NVidia thinking about?

      Blah blah blah I made this awesome processor but it only works for one tiny problem domain

      Yeah. These things don't work at all. Much like the brains of ignorant idiots posting this kind of drivel in /.

    • There's absolutely nothing wrong with designing chips with a larger number of smaller cores, especially if it removes a lot of the core-overengineering pressure present in the legacy x86 chip market and improves power efficiency for smaller applications, which are going to be more numerous in the future. The ability to customize ISAs for specific applications such as modern mobile robotics would also significantly improve the odds when competing against using generic large CPUs.
  • by wierd_w ( 1375923 ) on Thursday August 25, 2016 @06:16PM (#52772019)

    while being able to leverage that many compute units all a once is quite impressive, most tasks are still serial by nature. computers are not clairvoyant, so cannor know in advance what a branched logic chain will tell them to do for any arbitrary path depth, nor can they perform a computation on data that doesnt exist yet.

    thhe benefits of more cores are from parallel execution, not from doing tasks faster. as such, most software is not going to benefit from having access to 8000 more threads.

    • With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores. Or you could have 25 threads in a dataflow arrangement where each is a consumer of what the last just produced. Or you could go over the members of an array or matrix 25 members at a time with the same transformation. Some things are serial, but there are plenty of ways more cores can actually be used.

      • Or you can fake it with really good scheduling and context switching. Why is it that I was able to simultaneously watch a realplayer video in Netscape 4.x while editing a doc in OpenOffice with NO lag or skips or jitters, on a 200mHz box with 1 gig of ram in 1998, but I can't do that now with 2.6 gHz and 4 gigs? This is starting to really bug me, like where the FUCV is all that horsepower going???
        NOTE all of this is on Linux, various flavors. My Time-Warner cable is easily able to saturate the box during of

      • by Dadoo ( 899435 )

        With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores.

        In practice, most jobs running on a computer have some relation to each other, and the more jobs you have - and this CPU clearly expects to be able to run a lot of jobs - the more likely that will be. (Where I work, we actually have an application that gets slower when you add more cores.) Like most CPUs with high core counts, this one looks like it'll be great at compute-intensive tasks, but a

      • by goose-incarnated ( 1145029 ) on Thursday August 25, 2016 @07:33PM (#52772385) Journal

        With a multiuser, multitasking OS you can have 25 different unrelated processes running on something with 25 cores. Or you could have 25 threads in a dataflow arrangement where each is a consumer of what the last just produced. Or you could go over the members of an array or matrix 25 members at a time with the same transformation. Some things are serial, but there are plenty of ways more cores can actually be used.

        Nope. You'll generally hit the wall with around 16-20 cores using shared memory. You need distinct processors with dedicated memory to make multi-processing scale beyond 20 or so processors. Those huge servers with 32-cores apiece have their point of dminishing returns/processor after around 20 cores.

        First, the reason you aren't going to be doing multithreading/shared-memory on any known computer architectures, read this [slashdot.org].

        Secondly, let's say you aren't multithreading so you don't run into the problems in the link I posted above. Let's assume you run 25 separate tasks. You still run into the same problem, but at a lower level. The shared-memory is the throttle, because the memory only has a single bus. So you have 1000 cores. Each time an instruction has to be fetched[1] for one of those processors it needs exclusive access to those address lines that go to the memory. The odds of a core getting access to memory is roughly 1/n (n=number of cores/processors).

        On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.

        So, no. More cores sharing the same memory is not the answer. More cores with private memory is the answer but we don't have any operating system that can actually take advantage of that.

        A project that I am eyeing for next year is putting together a system that can effectively spread out the operating system over multiple physical memorys. While I do not think that this is feasible, it's actually not a bad hobby to tinker with :-)

        [1] Even though they'd be fetched in blocks, they still need to be fetched; a single incorrect speculative path will invalidate the entire cache.

        • On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.

          Sorry, that doesn't sound right. The expected length of time in the queue should be on the order of nt, where n is the number of cores and t is the average time required to process a memory-request. (A better formula would use the average length of the queue instead of n but to first order it still would be roughly linear with n.) So, the time required would increase linearly with the number of cores.

          • by fyngyrz ( 762201 )

            Also, there is caching, and also, some loads are heavy on longish FPU operations.

            So... it doesn't quite work out that way. Also, multicore designs can have separate memory.

            One example of multicore design that's both interesting and functional are the various vector processor graphics cores. Lots of em in there; and they get to do a lot of useful work you couldn't really do any other way with similar clock speeds and process tech.

            • by Bengie ( 1121981 )

              Also, multicore designs can have separate memory.

              NUMA comes to mind but it has complexity issues added to the OS and application. Accessing another CPU's memory is expensive, so the OS needs to try to keep the threads close to the data. The applications need to try to do cross socket communication by using a system API, assuming it exists, to find out which socket the thread is on and trying best to limit to current socket threads. Cross socket communication is probably best done via passing messages instead of reference because copying and storing the da

          • On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access. Further, The expected length of time in the queue is (1-(1/8)). This is of course, for an 8-core system. Adding more cores results in the waiting time increasing asymptotically towards infinity.

            Sorry, that doesn't sound right. The expected length of time in the queue should be on the order of nt, where n is the number of cores and t is the average time required to process a memory-request. (A better formula would use the average length of the queue instead of n but to first order it still would be roughly linear with n.) So, the time required would increase linearly with the number of cores.

            You're right, I worded it incorrectly (it's late, and I've been working 80hrs/week for the last year due to a private project. Forgive me). What I meant to say was "The expected delay when accessing memory is (1-(1/n))", but even that is off by an entire exponent.

            The expected delay is (probability of queueing) X ( probable length queue). The probability of queuing is (1-(1/n)):

            With 2 processors, you have a 1/2 chance of getting exclusive access, (1-(1/2)) of queuing.

            With 3 processors, you have a 1/3 chanc

            • by Bengie ( 1121981 )
              That's why the many core server CPUs have massive L3 caches and quad channel memory. 24 core x86 CPU with around 60MiB of L3 cache? Why not? More memory channels allow more concurrency of access. Intel NICs support written packets directly to L3 cache as to skip memory writes. Large on NIC buffers to make better use of DMA collecting and reduce memory operations, transferring in larger chunks to make use of that high bandwidth memory.

              In case it's not clear, I'm not trying to say your point isn't valid, ju
        • It's an interesting idea, and one I have given a little thought to. ( it would enable a very fault tolerant computer architecture) however, unless you implement highly redundant interconnects/busses, you still have the N-devices fighting for a shared resource problem.

          If you make the assertion that all nodes have a private direct connection with all other nodes, and thus eliminate the bottleneck that way, you now have to gracefully decide how to handle a downed private link.

          I suppose a hybrid might work. Ful

        • So, no. More cores sharing the same memory is not the answer. More cores with private memory is the answer but we don't have any operating system that can actually take advantage of that. A project that I am eyeing for next year is putting together a system that can effectively spread out the operating system over multiple physical memorys. While I do not think that this is feasible, it's actually not a bad hobby to tinker with :-)

          I thought Plan 9 was actually doing this?

        • by epine ( 68316 )

          On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access.

          You just snuck into your analysis the assumption that every core is memory saturated, and I don't think that all the memory path aggregates in many designs until the L3 cache (usually L1 not shared, L2 perhaps shared a bit). The real bottleneck ends up being the cache coherency protocol, and cache snoop events, handled on a separate bus, which might even have concurrent request channels.

          I think i

    • Unless the computer is figuring out every possible combination 1, 2, or more steps ahead. That how computers beat chess and could really improve in predictive modeling. Depth/Lookahead depends on how fast the logic flow branches. To think about it more, it'd be a great factor in human predictive analysis, from driving to combat.

      • by Megol ( 3135005 )

        That isn't possible. First the number of possibilities explode fast and second we are already at a power wall. Modern processors already do speculative computation however only in cases where it is likely the result is correct and needed. Just adding speculative execution will make the computer slower partially due to extra data movements (caches etc.) and partially because it will consume more power on a chip already difficult to cool.

        Branch predictors are doing most of the work already and doing it well.

    • Instead of branch prediction picking the most often used branch, and stalling when they get it wrong, just take all possible branches and toss out the ones that turned out to be wrong.
      • by Megol ( 3135005 )

        That's called eager execution (well, it have many names but that's the most common).

        In general it is a dumb idea. It requires more instruction throughput, more execution units, larger caches etc. for a small gain which in the real world is probably negative. Doing more things means more switching, switching means more power consumption (and the added resources will add to the leakage current too -> more power) and this means lower effective clock frequency.

        Even the limited form of eager execution where o

      • by Chaset ( 552418 )
        That sounds like we're starting to re-invent the Itanium.
        • Uh, no, in the original Itanium, it was the compiler that was supposed to do the branch prediction
      • Speculative execution? That's already happening, isn't it?
      • That's what a number of RISC processors used to do - execute both branches of an instruction, and flushing out the one that turned out wrong.
    • by godrik ( 1287354 ) on Thursday August 25, 2016 @09:39PM (#52772927)

      That is not really true. Most workloads can be executed in parallel. Pretty much all the field of scientific computing (would that be physics, chemistry, or biology) are typically quite parallel. If you are looking at database and data analytics, they are very parallel as well, if you are building topic models of the web, or trying to find correlation in twitter post, these things are highly parallel.

      Even on your machine, you are certainly using a fair amount of parallel computing, most likely video decompression is done in parallel (or it should be). It is the old argument that by decreasing frequency you can increase core count in the same power envelop while increasing performance.

      For sure, some applications are not sequential. Most likely, they are not the one we really care about. Otherwise, hire me, and I'll write them in parallel :)

  • by Anonymous Coward on Thursday August 25, 2016 @06:36PM (#52772091)

    the type of cores:
    Some of OpenPiton® features are listed below:

            OpenSPARC T1 Cores (SPARC V9)
            Written in Verilog HDL
            Scalable up to 1/2 Billion Cores
            Large Test Suite (>8000 tests)
            Single Tile FPGA (Xilinx ML605) Prototype

    The bit that may put some people off:
    This work was partially supported by the NSF under Grants No. CCF-1217553, CCF-1453112, and CCF-1438980, AFOSR under Grant No. FA9550-14-1-0148, and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.

    So interesting and possibly FGPA synthesizable test processor it may be. Trustworthy computer core it may *NOT* be. (You would have to compare it to the original T1 cores, and have had those independently audited to ensure no nefarious timing attacks, etc were in place.)

    Now, having said that, if this interconnect is even a fraction as good as they claim, it could make for an AWESOME libre SPARC implementation competitive with Intel/AMD for non-Wintel computing uses. Bonus for someone taping out an AM3+ socket chip (or AM4 if all the signed firmware is SoC-side and not motherboard/southbridge side.) that can be initialized on a commercially available board with standard expansion hardware. AM3/3+ would offer both IGP and discrete graphics options if a chip could be spun out by middle of 2017, and if AMD was convinced to continue manufacturing their AM3 chipset lines we could have 'libreboot/os' systems for everything except certain hardwares initialization blobs. IOMMUv1 support on the 9x0(!960) chipsets could handle most of the untrustworthy hardware in a sandbox as well, although you would lose out on HSA/XeonPhi support due to the lack of 64 bit BARs and memory ranges.

    • by Anonymous Coward

      Now that was a good /. post, reminiscent of yesteryear.

      Get off my lawn.

    • by wbr1 ( 2538558 )
      So the NSF is not to be trusted? Are they the sister of the NSA?

      Tinfoil is apt in many circumstances, but geez keep it where it belongs.

      • I think he meant the AFOSR and DARPA involvement.
      • by rthille ( 8526 )

        I imagine the poster was referring to DARPA more than the NSF, but I imagine that any association with the US Govt. could engender distrust in such matters these (post Snowden) days.

        • I imagine that any association with the US Govt. could engender distrust in such matters these (post Snowden) days.

          It might engender quantitatively more distrust than in pre-Snowden days, but probably not deeper distrust. The US government has been deeply distrusted by the rest of the world for a long time.

    • For those wondering why the distrust, here [ieee.org] is a good article describing why the US govt is not to be trusted.
      • They're not the government, they're funded by government grants. As someone who has been funded by government grants, I can assure you it is completely different.

    • Ok, so it's OpenSPARC based? That's cool. So what do they have running on it - Linux, BSD or Solaris?
    • Branch delay slots? Register windows? This is one of the first RISC architectures, and it has warts [jwhitham.org]. Fujitsu just abandoned [pcworld.com] it.

  • by Areyoukiddingme ( 1289470 ) on Thursday August 25, 2016 @07:56PM (#52772463)

    Perhaps more interesting is the semi-detailed presentation [youtube.com] about AMD's Zen. Other people have already pointed out that a paltry few hundred million transistors doesn't get you very far. What are the billions of transistors used for? The Zen presentation is quite informative. Loads of cache is a fair chunk of it. Überfancy predictive logic is another big chunk of it. The rest is absorbed by 4 completely parallel ALUs, two parallel AGUs, and a completely independent floating point section with two MUL and two ADD logics. And after all that, what you get is parity with Intel's Broadwell. Barely.

    So for perspective, that took a decade of hard labor by quite well paid engineers, and there was no low-hanging fruit in the form of the register-starved x86 architecture for AMD to pluck this time. The difference between half a billion and two billion transistors is very very substantial.

    • by Megol ( 3135005 )

      Perhaps more interesting is the semi-detailed presentation [youtube.com] about AMD's Zen. Other people have already pointed out that a paltry few hundred million transistors doesn't get you very far. What are the billions of transistors used for? The Zen presentation is quite informative. Loads of cache is a fair chunk of it. Überfancy predictive logic is another big chunk of it. The rest is absorbed by 4 completely parallel ALUs, two parallel AGUs, and a completely independent floating point section with two MUL and two ADD logics. And after all that, what you get is parity with Intel's Broadwell. Barely.

      Intel Broadwell E. There's a big difference. And barely being in parity with one of the best performing processors in the world (classified by Intel as an "enthusiast" processor) is a good thing.

      So for perspective, that took a decade of hard labor by quite well paid engineers, and there was no low-hanging fruit in the form of the register-starved x86 architecture for AMD to pluck this time. The difference between half a billion and two billion transistors is very very substantial.

      Yes it is a factor of 4. Given that Zen is/is to be mass produced in a small process the price/chip are probably skewed strongly towards AMD. Performance is likely to be better for Zen for real world code (read: not embarrassingly parallel).

  • The good news is that this thing uses an existing processor core, OpenSPARC T1 (SPARC V9), so there's plenty of software around for it. (Yes, it runs -- or I imagine it will soon -- Linux.)

    The bad news is that this thing uses an existing processor core, instead of a more secure architecture (say, something segment based with tag bits, like the B6700 among others) which would render it much more resistant (dare I say immune?) to things like buffer overflows and such.

    • The bad news is that this thing uses an existing processor core, instead of a more secure architecture (say, something segment based with tag bits, like the B6700 among others) which would render it much more resistant (dare I say immune?) to things like buffer overflows and such.

      Sounds like it's intended for HPC though, so security's not much of an issue.

    • Plenty of software, yes - given that SunOS and Solaris were overwhelmingly the most popular UNIXes of the day. Linux? RedHat dropped support for it some versions ago, and I'm not sure whether Debian still does or not. I know that all 3 BSDs - OpenBSD, NetBSD and FreeBSD - do.

I have hardly ever known a mathematician who was capable of reasoning. -- Plato

Working...