Forgot your password?
typodupeerror
Intel Hardware

Intel's Knights Landing — 72 Cores, 3 Teraflops 208

Posted by Soulskill
from the go-big-or-go-home dept.
New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"
This discussion has been archived. No new comments can be posted.

Intel's Knights Landing — 72 Cores, 3 Teraflops

Comments Filter:
  • Imagine (Score:3, Funny)

    by Konster (252488) on Saturday January 04, 2014 @06:14PM (#45867485)

    Imagine a Beowulf cluster of these!

  • Summary asks:

    Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics?

    ...but first it says it has 16GB of eDRAM. The 128MB is eDRAM in their "Iris Pro" adds almost $200 to the price tag.

    This chip is going to cost MANY THOUSANDS OF DOLLARS.
    • by rsmith-mac (639075) on Saturday January 04, 2014 @07:04PM (#45867711)

      "eDRAM" in this article is almost certainly an error for that reason.

      eDRAM isn't very well defined, but it basically boils down to "DRAM manufactured on a modified logic process," allowing it to be placed on-die alongside logic, or at the very least built using the same tools if you're a logic house (Intel, TSMC, etc). This is as opposed to traditional DRAM, which is made on dedicated processes that is optimized for space (capacitors) and follows its own development cadence.

      The article notes that this is on-package as opposed to on-die memory, which under most circumstances would mean regular DRAM would work just fine. The biggest example of on-package RAM would be SoCs, where the DRAM is regularly placed in the same package for size/convenience and then wire-bonded to the processor die (although alternative connections do exist). Conversely eDRAM is almost exclusively used on-die with logic - this being its designed use - chiefly as a higher density/lower performance alternative to SRAM. You can do off-die eDRAM, which is what Intel does for Crystalwell, but that's almost entirely down to Intel using spare fab capacity and keeping production in house (they don't make DRAM) as opposed to technical requirements. Which is why you don't see off-die eDRAM regularly used.

      Or to put it bluntly, just because DRAM is on-package doesn't mean it's eDRAM. There are further qualifications to making it eDRAM than moving the DRAM die closer to the CPU.

      But ultimately as you note cost would be an issue. Even taking into account process advantages between now and the Knight's Landing launch, 16GB of eDRAM would be huge. Mind bogglingly huge. Many thousands of square millimeters huge. Based on space constraints alone it can't be eDRAM; it has to be DRAM to make that aspect work, and even then 16GB of DRAM wouldn't be small.

      • Re: (Score:3, Informative)

        by Anonymous Coward

        It may not be eDRAM, but I'm not sure what else Intel would easily package with the chip. We know the 128 MB of eDRAM on 22 nm is ~80 mm^2 of silicon, currently Intel is selling ~100 mm^2 of N-1 node silicon for ~$10 or less (See all the ultra cheap 32 nm clover trail+ tablets where they're winning sockets against allwinner, rockchip, etc., indicating that they must be selling them for equivalent or better prices than these companies.) By the time this product comes out 22 nm will be the N-1 node. In additi

    • An Nvidia Quadro card costs $8,000 for an 8GB card. I would consider $8,000 "many thousands of dollars". Nobody is suggesting Knights ____ is competing with any consumer chips CPU or GPU. I have a $1,500 Raytracing card in my system along with a $1,000 GPU as well as a $1,000 CPU. If this could replace the CPU and GPU but compete with a dual CPU system for rendering performance I would be a happy camper even if it cost $3-4k.

  • Programmability? (Score:5, Informative)

    by gentryx (759438) * on Saturday January 04, 2014 @06:20PM (#45867509) Homepage Journal

    I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.

    For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.

    • by godrik (1287354)

      Actually the in-order execution isn't so much of a problem in my experience. The vectorization is a real problem. But you essentially have the same problem except it us hidden in the programming model. But the performance problem are here as well.

      Anybody that understand gpu architecture enough to write efficient code there won;t have much problem using the mic architecture. The programming model is different but the key diffucultues are essentially the same. If you think about mic simd element as a cuxa th

      • It's not entirely syntactical. Local shared memory is exposed to the CUDA programmer (e.g., __sync_threads()). CUDA programmers also have to be mindful of register pressure and the L1 cache. These issues directly affect the algorithms used by CUDA programmers. CUDA programmers have control over very fast local memory---I believe that this level of control is missing from MIC's available programming models. Being closer to the metal usually means a harder time programming, but higher performance potenti

        • by godrik (1287354)

          I don't understand. Mic is your regular cache based architecture. Accessing L1 cache in mic is very fast (3 cycle latency if my memory is correct). You have similar register constraints on mic with 32 512-bit vectors per thread(core maybe). Both architectures overlap memory latency by using hardware threading.

          I programmed both mic and gpu, mainly on sparse algebra and graph kernels. And quite frankly there are differences but i find much more alike than most people acknowledge. The main difference in my op

    • by imevil (260579)

      I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke

      I tried recompiling and running some OpenCL code (that previously was running on GPUs). It was "just recompile and run" and the promises about performances were kept. But still, OpenCL is not what most people consider "nice to program".

      • by gentryx (759438) *
        Yeah, OpenCL is a different thing. But if you talk to laymen, they will often repeat the marketing speed that you take your OpenMP(!) code written for traditional multi-cores, recompile and enjoy... Not true, in my experience.
    • Intel's AVX-512 is really friggin cool, and a huge departure from their SIMD of the past. It adds some important features -- most notably mask registers to optimally support complex branching -- which make it nearly identical to GPU coding so that compilers will have a dramatically easier time targeting it. I doubt it will kill discrete GPUs any time soon, but it's a big step in that long-term direction.

    • The recently revealed Mill architecture [ootbcomp.com] is far more interesting, and also offers a much more attractive programming model. It is a highly orthogonal architecture naturally capable of wide MIMD and SIMD. Vectorization and software pipelining of loops is discussed in the "metadata" talk, and is very clever and elegant. Those who have personally experienced the tedium of typical vector extensions will appreciate it all the more.

      Based on sim, the creators expect an order of magnitude improvement of performan

  • In my opinion, the point of using x86 in order to reuse units from desktop/server CPUs is the base of these experiments. The counterpart is to deal with the x86-mess everywhere. This seems a desperate reaction to AMD's CPU+GPGPU, which also has drawbacks. I bet that both Intel and AMD prefer to keep memory controller as simpler as possible, having a confortable long-run, without burning their ships too early. E.g. a CPU+GPGPU in the same die, with 8 x 128 bit separate memory controllers configured as NUMA (
    • by cnettel (836611)
      20 years? I would be very doubtful regarding any prediction beyond the point where current process scaling trends finally break. Note, they might break the other way. Switching to a non-silicon material might allow higher frequencies which will again shift the tradeoff between locality, energy, and production cost. But there is no reason, no reason at all, to expect the current style to last for more than ten years, while you could be quite right that it could stay much the same for the next five years or s
    • 8 128bit memory controllers? 1024 pins just for the memory bus? you've got to be kidding.

  • You aren't ever going to see this at Newegg.
  • To bad most Intel cpus don't have it and just about all 2011 boards don't use it. The ones that do use it for dual cpu.

    To bad apple mac pro does not have this and is not likely to use any time soon.

  • Unobtainium (Score:3, Insightful)

    by Anonymous Coward on Saturday January 04, 2014 @07:12PM (#45867745)

    This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..

    • by Guy Harris (3803)

      This is another one of those IBM things made from the most rare element in the universe: unobtainium

      Presumably meaning "this is like those IBM things", given that, while the first word of the title begins with "I", it doesn't have "B" or "M" following it, it has "n", "t", "e", and "l", instead.

    • This is x86. Theoretically your program already runs on this. You don't have to rewrite your entire application to run on CUDA.

    • "Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B."

      I would rather have that market than all of the rest.

  • by Animats (122034) on Saturday January 04, 2014 @07:33PM (#45867821) Homepage

    OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?

    Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.

    Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.

    • by dbIII (701233)

      Historically, meshes of processors without shared memory have been painful to program

      Which is why we don't see those GPU cards in absolutely every place where there is a massively parallel problem to solve. Even 8GB is not enough for some stuff and you spend so much time trying to keep the things fed that the problem could already be solved on the parent machine.

    • by joib (70841) on Sunday January 05, 2014 @02:47AM (#45869219)
      The mesh replaces the ring bus used in the current generation MIC as well as mainstream Intel x86 CPU's. Each node in the mesh is 2 CPU cores and L2 cache. The mesh is used for connecting to the DRAM controllers, external interfaces, L3 cache, and of course, for cache coherency. The memory consistency model is the standard x86 one. So from a programmability point of view, it's a multi-core x86 processor, albeit with slow serial performance and beefy vector units.
  • by Required Snark (1702878) on Saturday January 04, 2014 @08:51PM (#45868115)
    This will have the same useability as the CELL CPU. From TFA:

    Second, while Knights Landing can act as a bootable CPU, many applications will demand greater single threaded performance due to Amdahl’s Law. For these workloads, the optimal configuration is a Knights Landing (which provides high throughput) coupled to a mainstream Xeon server (which provides single threaded performance). In this scenario, latency is critical for communicating results between the Xeon and Knights Landing.

    So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).

    The rumors also state that the KNL core will replace each of the floating point pipelines in Silvermont with a full blown 512-bit AVX3 vector unit, doubling the FLOPs/clock to 32.

    So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...

    • by dbIII (701233)

      Plus writing software that uses 72 cores is such a walk in the park

      Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.

      • by cnettel (836611)

        Plus writing software that uses 72 cores is such a walk in the park

        Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.

        More importantly, for some applications a limited amount of very low-latency/high-bandwidth communication is enough to give spectacular performance improvements. In those cases, the fully coherent x86 model, kept up by this kind of cache and memory architecture, will do wonders, compared to an MPI implementation with weaker individual nodes, but also possibly against (current) nVidia offerings. It's harder to say how it will stack up against Maxwell.

  • My slow ass typing in MS Word will be FASTER than ever!

3500 Calories = 1 Food Pound

Working...