Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Intel Hardware

Intel's Knights Landing — 72 Cores, 3 Teraflops 208

New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"
This discussion has been archived. No new comments can be posted.

Intel's Knights Landing — 72 Cores, 3 Teraflops

Comments Filter:
  • Programmability? (Score:5, Informative)

    by gentryx ( 759438 ) * on Saturday January 04, 2014 @07:20PM (#45867509) Homepage Journal

    I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.

    For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.

  • Requires parallelism (Score:5, Informative)

    by tepples ( 727027 ) <tepples.gmail@com> on Saturday January 04, 2014 @07:42PM (#45867617) Homepage Journal
    Multicore implies more speed only if your process is parallelized. Not all interactive processes on a single-user computer can be, wrote Amdahl [wikipedia.org].
  • by rsmith-mac ( 639075 ) on Saturday January 04, 2014 @08:04PM (#45867711)

    "eDRAM" in this article is almost certainly an error for that reason.

    eDRAM isn't very well defined, but it basically boils down to "DRAM manufactured on a modified logic process," allowing it to be placed on-die alongside logic, or at the very least built using the same tools if you're a logic house (Intel, TSMC, etc). This is as opposed to traditional DRAM, which is made on dedicated processes that is optimized for space (capacitors) and follows its own development cadence.

    The article notes that this is on-package as opposed to on-die memory, which under most circumstances would mean regular DRAM would work just fine. The biggest example of on-package RAM would be SoCs, where the DRAM is regularly placed in the same package for size/convenience and then wire-bonded to the processor die (although alternative connections do exist). Conversely eDRAM is almost exclusively used on-die with logic - this being its designed use - chiefly as a higher density/lower performance alternative to SRAM. You can do off-die eDRAM, which is what Intel does for Crystalwell, but that's almost entirely down to Intel using spare fab capacity and keeping production in house (they don't make DRAM) as opposed to technical requirements. Which is why you don't see off-die eDRAM regularly used.

    Or to put it bluntly, just because DRAM is on-package doesn't mean it's eDRAM. There are further qualifications to making it eDRAM than moving the DRAM die closer to the CPU.

    But ultimately as you note cost would be an issue. Even taking into account process advantages between now and the Knight's Landing launch, 16GB of eDRAM would be huge. Mind bogglingly huge. Many thousands of square millimeters huge. Based on space constraints alone it can't be eDRAM; it has to be DRAM to make that aspect work, and even then 16GB of DRAM wouldn't be small.

  • by tepples ( 727027 ) <tepples.gmail@com> on Saturday January 04, 2014 @08:20PM (#45867773) Homepage Journal
    You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize [wikipedia.org]. So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.
  • by Animats ( 122034 ) on Saturday January 04, 2014 @08:33PM (#45867821) Homepage

    OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?

    Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.

    Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.

  • by Anonymous Coward on Saturday January 04, 2014 @09:50PM (#45868113)

    It may not be eDRAM, but I'm not sure what else Intel would easily package with the chip. We know the 128 MB of eDRAM on 22 nm is ~80 mm^2 of silicon, currently Intel is selling ~100 mm^2 of N-1 node silicon for ~$10 or less (See all the ultra cheap 32 nm clover trail+ tablets where they're winning sockets against allwinner, rockchip, etc., indicating that they must be selling them for equivalent or better prices than these companies.) By the time this product comes out 22 nm will be the N-1 node. In addition, a dedicated eDRAM chip is probably cheaper than a typical SoC/logic chip due to the smaller number of metal levels that are needed. Assuming N-1 node prices hold for a given area of silicon, 16 GB will need 12000 mm^2 of silicon (likely less as the current 128 MB die likely uses a not insignificant area for readout circuitry and PHY interface), coming out to around $1200. Add an extra $1000 for your actual processor and you have the current price of a low end Xeon Phi.

  • by joib ( 70841 ) on Sunday January 05, 2014 @03:47AM (#45869219)
    The mesh replaces the ring bus used in the current generation MIC as well as mainstream Intel x86 CPU's. Each node in the mesh is 2 CPU cores and L2 cache. The mesh is used for connecting to the DRAM controllers, external interfaces, L3 cache, and of course, for cache coherency. The memory consistency model is the standard x86 one. So from a programmability point of view, it's a multi-core x86 processor, albeit with slow serial performance and beefy vector units.
  • by Guy Harris ( 3803 ) <guy@alum.mit.edu> on Sunday January 05, 2014 @03:57AM (#45869253)

    Where are you getting Atom cores from?

    From this Extremetech article [extremetech.com], which has a slide speaking of the Knights Landing processor architecture having "up to 72 Intel Architecture cores based on Silvermont (Intel(R) Atom processor)"?

  • by Bengie ( 1121981 ) on Sunday January 05, 2014 @03:33PM (#45872355)
    You're both correct. The original Atom cpu was built separately and started before the i7 arch. The new Silvermont "Atom" is based a lot of the i7 arch. It is a huge upgrade to the Atom line. It's like the original i7 fine tuned for power and running on 22nm. Very strong OoO pipeline design. The low power usage is great for a many core design because efficiently is more important than single-threaded performance.

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.

Working...