Intel's Knights Landing — 72 Cores, 3 Teraflops 208
New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"
Imagine (Score:3, Funny)
Imagine a Beowulf cluster of these!
Re: (Score:2)
It wouldn't be very different than the most powerful supercomputer in the world: http://top500.org/system/177999 [top500.org]
Imagine, 2 (Score:3, Funny)
ipad (Score:5, Funny)
They tested this for the next ipad. While apple felt the 5 second battery life was too short to be practical, the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms
Re: (Score:2)
the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms
Some people would see this as a feature.
Re: ipad (Score:5, Funny)
To be fair, Apple are very committed to branding.
Re: (Score:2)
LMAO
Re: (Score:3, Funny)
Forget about Linux! With this baby, I can finally run Crysis.
No it cannot compete with nVidia and AMD/ATI (Score:2)
Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics?
This chip is going to cost MANY THOUSANDS OF DOLLARS.
Re:No it cannot compete with nVidia and AMD/ATI (Score:5, Informative)
"eDRAM" in this article is almost certainly an error for that reason.
eDRAM isn't very well defined, but it basically boils down to "DRAM manufactured on a modified logic process," allowing it to be placed on-die alongside logic, or at the very least built using the same tools if you're a logic house (Intel, TSMC, etc). This is as opposed to traditional DRAM, which is made on dedicated processes that is optimized for space (capacitors) and follows its own development cadence.
The article notes that this is on-package as opposed to on-die memory, which under most circumstances would mean regular DRAM would work just fine. The biggest example of on-package RAM would be SoCs, where the DRAM is regularly placed in the same package for size/convenience and then wire-bonded to the processor die (although alternative connections do exist). Conversely eDRAM is almost exclusively used on-die with logic - this being its designed use - chiefly as a higher density/lower performance alternative to SRAM. You can do off-die eDRAM, which is what Intel does for Crystalwell, but that's almost entirely down to Intel using spare fab capacity and keeping production in house (they don't make DRAM) as opposed to technical requirements. Which is why you don't see off-die eDRAM regularly used.
Or to put it bluntly, just because DRAM is on-package doesn't mean it's eDRAM. There are further qualifications to making it eDRAM than moving the DRAM die closer to the CPU.
But ultimately as you note cost would be an issue. Even taking into account process advantages between now and the Knight's Landing launch, 16GB of eDRAM would be huge. Mind bogglingly huge. Many thousands of square millimeters huge. Based on space constraints alone it can't be eDRAM; it has to be DRAM to make that aspect work, and even then 16GB of DRAM wouldn't be small.
Re: (Score:3, Informative)
It may not be eDRAM, but I'm not sure what else Intel would easily package with the chip. We know the 128 MB of eDRAM on 22 nm is ~80 mm^2 of silicon, currently Intel is selling ~100 mm^2 of N-1 node silicon for ~$10 or less (See all the ultra cheap 32 nm clover trail+ tablets where they're winning sockets against allwinner, rockchip, etc., indicating that they must be selling them for equivalent or better prices than these companies.) By the time this product comes out 22 nm will be the N-1 node. In additi
Re: (Score:3)
An Nvidia Quadro card costs $8,000 for an 8GB card. I would consider $8,000 "many thousands of dollars". Nobody is suggesting Knights ____ is competing with any consumer chips CPU or GPU. I have a $1,500 Raytracing card in my system along with a $1,000 GPU as well as a $1,000 CPU. If this could replace the CPU and GPU but compete with a dual CPU system for rendering performance I would be a happy camper even if it cost $3-4k.
Programmability? (Score:5, Informative)
I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.
For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.
Re: (Score:2)
Actually the in-order execution isn't so much of a problem in my experience. The vectorization is a real problem. But you essentially have the same problem except it us hidden in the programming model. But the performance problem are here as well.
Anybody that understand gpu architecture enough to write efficient code there won;t have much problem using the mic architecture. The programming model is different but the key diffucultues are essentially the same. If you think about mic simd element as a cuxa th
Re: (Score:3)
It's not entirely syntactical. Local shared memory is exposed to the CUDA programmer (e.g., __sync_threads()). CUDA programmers also have to be mindful of register pressure and the L1 cache. These issues directly affect the algorithms used by CUDA programmers. CUDA programmers have control over very fast local memory---I believe that this level of control is missing from MIC's available programming models. Being closer to the metal usually means a harder time programming, but higher performance potenti
Re: (Score:3)
I don't understand. Mic is your regular cache based architecture. Accessing L1 cache in mic is very fast (3 cycle latency if my memory is correct). You have similar register constraints on mic with 32 512-bit vectors per thread(core maybe). Both architectures overlap memory latency by using hardware threading.
I programmed both mic and gpu, mainly on sparse algebra and graph kernels. And quite frankly there are differences but i find much more alike than most people acknowledge. The main difference in my op
Re: (Score:2)
I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke
I tried recompiling and running some OpenCL code (that previously was running on GPUs). It was "just recompile and run" and the promises about performances were kept. But still, OpenCL is not what most people consider "nice to program".
Re: (Score:2)
Re: (Score:3)
Intel's AVX-512 is really friggin cool, and a huge departure from their SIMD of the past. It adds some important features -- most notably mask registers to optimally support complex branching -- which make it nearly identical to GPU coding so that compilers will have a dramatically easier time targeting it. I doubt it will kill discrete GPUs any time soon, but it's a big step in that long-term direction.
Re: (Score:3)
The recently revealed Mill architecture [ootbcomp.com] is far more interesting, and also offers a much more attractive programming model. It is a highly orthogonal architecture naturally capable of wide MIMD and SIMD. Vectorization and software pipelining of loops is discussed in the "metadata" talk, and is very clever and elegant. Those who have personally experienced the tedium of typical vector extensions will appreciate it all the more.
Based on sim, the creators expect an order of magnitude improvement of performan
Not going to work (Score:2)
Re: (Score:2)
Re: (Score:2)
8 128bit memory controllers? 1024 pins just for the memory bus? you've got to be kidding.
Calm down (Score:2)
QPI? (Score:2)
To bad most Intel cpus don't have it and just about all 2011 boards don't use it. The ones that do use it for dual cpu.
To bad apple mac pro does not have this and is not likely to use any time soon.
Unobtainium (Score:3, Insightful)
This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..
Re: (Score:2)
This is another one of those IBM things made from the most rare element in the universe: unobtainium
Presumably meaning "this is like those IBM things", given that, while the first word of the title begins with "I", it doesn't have "B" or "M" following it, it has "n", "t", "e", and "l", instead.
Re: (Score:2)
This is x86. Theoretically your program already runs on this. You don't have to rewrite your entire application to run on CUDA.
Re: (Score:3)
"Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B."
I would rather have that market than all of the rest.
How does the intercommunication work? (Score:5, Informative)
OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?
Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.
Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.
Re: (Score:2)
Which is why we don't see those GPU cards in absolutely every place where there is a massively parallel problem to solve. Even 8GB is not enough for some stuff and you spend so much time trying to keep the things fed that the problem could already be solved on the parent machine.
Re:How does the intercommunication work? (Score:5, Informative)
Intel's version of a IBM/Sony Cell CPU (Score:3)
So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).
So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...
Re: (Score:2)
Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.
Re: (Score:2)
Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.
More importantly, for some applications a limited amount of very low-latency/high-bandwidth communication is enough to give spectacular performance improvements. In those cases, the fully coherent x86 model, kept up by this kind of cache and memory architecture, will do wonders, compared to an MPI implementation with weaker individual nodes, but also possibly against (current) nVidia offerings. It's harder to say how it will stack up against Maxwell.
Wow. (Score:2)
Re: (Score:2, Insightful)
Because you can never have too many cores that you aren't using most of the time.
Ask the NSA, they might have a (SECRET) opinion on that.
Re: (Score:3, Insightful)
Yes, it's too hard. The future is in concurrency. The actor model will probably take off since it's easy to pick up and use.
Re: (Score:3, Insightful)
Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).
Re:Yay more cores that I won't be using much of! (Score:5, Insightful)
Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).
However, for servers, including hypervisors, it would be very interesting. There are lots of client/server products that scale better with more cores.
Re: (Score:2)
Re:Yay more cores that I won't be using much of! (Score:4, Informative)
Where are you getting Atom cores from?
From this Extremetech article [extremetech.com], which has a slide speaking of the Knights Landing processor architecture having "up to 72 Intel Architecture cores based on Silvermont (Intel(R) Atom processor)"?
Re:Yay more cores that I won't be using much of! (Score:4, Informative)
Re: (Score:2)
Couple things:
1) The 22nm Silvermont Atom cores are a complete redesign over the badly aging atom cores. They are much more powerful, and much more powerful per watt.
2) These chips aren't going to replace CPUs, they are most likely going to compete with Nvidia Tesla - a PCIe card that highly parallel workloads can be offloaded to. One CUDA core isn't very powerful but stick 2688 actives ones on a chip and for certain tasks you have a lot of power. The K20X Tesla is capable of 1.3 trillion double-precisio
Re: (Score:3)
Pretty sure it wasn't meant for you (or me).
Obviously -- 64 cores should be enough for any one person.
Re: (Score:2)
That's an hpc processor. You are unlikely to deploy that on classical desktop/laptop for a while. Think about it as a classical coprocessor.
Re: (Score:2)
It depends on the use case. There are many applications where this would shine. Sure if you want to play Quake 3 Arena it's not going to give you much at all, but if you're doing parallel processing for scientific or engineering applications this would rock.
Re:Yay more cores that I won't be using much of! (Score:5, Funny)
Because you can never have too many cores that you aren't using most of the time.
Install McAfee Antivírus, and problem solved: no more unused cores.
Re: (Score:2)
This isn't intended for you if you can't think of what to do with all those cores.
This is for the high performance physics folks to whom the difference between 16 cores, 256 cores, and maybe even 8192 cores is a line in a config file.
It's also for the folks developing 24 megapixel RAW files (which Nikon's cheapest SLR spits out these days), where splitting the image into 64 sectors is no more difficult than splitting it into four, or for the folks doing video encoding which is pretty trivially parallelizabl
Re: (Score:2)
Requires parallelism (Score:5, Informative)
Re: (Score:2)
Re: (Score:2)
I think you'd be surprised how many real world day to day task can be and are parallelized: almost everything concerning audio and video (images or movies), searching, analyzing, rendering web pages, compiling, computing physics and AI for games.
I can't think of one computing intensive day to day action that is not parallelized or wouldn't be easy to do so.
I fail to see parallelism in CSS flow (Score:4, Insightful)
I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching
I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?
rendering web pages
I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.
compiling
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
Re: (Score:2)
High performance RDBMS indexes do indeed parallelize scans and index searches.
Re: (Score:2)
WTF is going on here? I typed "engines", not "indexes".
Is slashdot now EDITING posts before publishing them, or is Firefox screwing with me?
Re: (Score:2, Funny)
Are you in Colorado?
Re: (Score:2)
Confronted with the fact that I proof-read my post, hit submit, and the comment posted was different than what I'd just proof-read, yes, I do presume something is fucking with the system.
compilation often not just one single program (Score:2)
In my experience, most cases where compilation takes a long time involve multiple compilation units. I have a fair bit of experience with compiling linux distros professionally...when you're building glibc and the kernel and five hundred other packages it'll use as many cores as you can throw at it.
-fwhole-program --combine (Score:2)
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
In my experience, most cases where compilation takes a long time involve multiple compilation units.
That's what I said. But a lot of times nowadays, the compiler is set to perform whole-program optimization [wikipedia.org] on release builds to try to save cycles even in calls from a function in one translation unit of a program to a function in another. Mozilla's Firefox web browser, for example, is so big that it can't be compiled with profile-guided whole-program optimization on 32-bit machines [slashdot.org]. But I'll grant that a multi-core CPU speeds up debug builds.
when you're building glibc and the kernel and five hundred other packages
Not many people are maintainers of an operating system distributi
Re: (Score:2)
I don't see how rendering a web page can be fully parallelized.
Parsing and reflow can be efficiently parallelized if sufficient parents have their heights determined by something other than their contents, for example, say the if the main parts of the documents have heights explicitly defined. Then they can be processed in parallel efficiently. Even without that, couldn't the children each be processed in parallel for a good portion of them, but possibly needing updating for properties that have dependencies outside of themselves? Yes, floats can cause some issues,
Re: (Score:2)
If your data is not indexed you are likely to be faster with multiple threads (if there is no other bottle neck like, for example, disk throughput).
Or RAM throughput.
Parsing: Why not?
Sure, the browser can parse multiple CSS files or multiple HTML files or multiple JavaScript files at once, just as the browser can decode multiple images at once. But the parser for a single file is a state machine. In order to "drop the needle" halfway into the byte stream and start parsing the second half on the second core, the parser would first have to know what state the state machine was in as of halfway into the stream. What parallelization were you thinking of?
And if I have multiple boxes at the same layer?
Once the browser fi
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Are there really that many interactive processes on a single-user computer that are
1) CPU-bound
2) not parallelizable
3) take long enough that waiting on them gets annoying?
I ask out of genuine curiosity; I can't think of many times when I wind up waiting on my computer to do anything that fits.
Re: (Score:2)
edit: Compiling is one, definitely. Forgot about that.
Reflow in web browsers and word processors (Score:2)
As I wrote elsewhere [slashdot.org]: laying out a web page that includes float-styled elements. That fits 1) and 2), and it fits 3) on a netbook or tablet with an ARM or Atom processor. Or repaginating a document in a word processor, which happens every time the user enters enough text to make the current paragraph one line longer, deletes enough to make it one line shorter, or changes the styling of any span of text. Repagination may affect figures, references to page numbers elsewhere in the document, etc. Repaginating
Re: (Score:2)
There are parallel strategies to do some of these things. As far as I know text layouting is mostly done with dynamic programming algorihtms. These algorithms are usually very parallel.
Even if they are not, you can always use some kinds of speculative algorithms to deal with that. You assume the 3 most likely scenario for line 1 and while line 1 is being processed, you layout line 2 multiple times using different assumption on line 1. This will not give you perfect parallelism but it will give you some impr
Re: (Score:2)
I don't see how the DEFLATE codec used by, say, PNG can be parallelized.
There are multiple ways to implement the deflate codec, some compress better than others on different source materials. The best implementations would try multiple variants in parallel and discard all but the best result. For current examples, running PNGOUT, OptiPNG, and DeflOPT in parallel for each PNG and discard the other two, but better approaches trying more variants for even better (albeit less) results are possible and likely to produce even smaller results.
Apparently ... (Score:2)
you aren't doing much on your computer. Try doing special effects graphics, or stock market analysis. Or even just start up an Android emulator - it's excruciatingly slow.
Re: (Score:2)
Re: (Score:2)
I'd love some of these if they come off with better price/performance than an AMD system or even if they just beat it a lot on performance without being ten times the cost (sad state of the very top end of Xeons now).
Re: (Score:2)
That many cores implies a big fat pipe to memory as well. Sure they have local cache but memory is going to be the bottle neck here even with parallelized computation.
Re: (Score:2)
This isn't for general-purpose use. See those floating-point specs? Those tell you exactly where this is going, because there is one class of user that just can't get enough floating point performance. Scientific HPC. Protein folding, molecular biology modeling, cosmological simulations, higher resolution seismic analysis, neural network simulation, quantum system modeling. All things that thrive on processing power. A chip like this could have a lot of scientific applications.
Re: (Score:2)
Re: (Score:3)
Keep in mind, Amdahl's law can be expanded to all processes that make up a system. Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
If the program uses async I/O, that counts as parallelism.
Then a dual core should be plenty (Score:2)
Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other. To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do), or run more than one user at once using dual monitors, dual keyboards, and dual mice (which most desktop PC ope
Re: (Score:2)
Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other.
Wow, I just realized you are right, and got depressed.
Re: (Score:3)
Not necessarily. A process could be CPU bound and prefer not to make it worse by also waiting for I/O completion. Let another core drive the filesystem and talk to the block device (which might be a soft RAID).
My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.
If all you're doing is word processing, single thread speed isn't all that important either since it's mostly waiting for you to press a key.
Re: (Score:3)
To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do)
Let another core drive the filesystem and talk to the block device (which might be a soft RAID).
My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.
Then you're not most users. I was under the impression that most users tend not to use soft RAID 5/6 or CPU-intensive file systems, compress large videos, or do large compiles. I too compress video and do compiles, but geeks such as you and myself are edge cases.
Re: (Score:2)
Re: (Score:2)
While most people probably don't do large compiles, the video compression is just for shows I record. In my case, it just happens to happen on a PC, others might use an appliance for that. My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.
The people not doing any of that probably wouldn't fully utilize the full speed of a single core either, so it's not much of an issue.
Re: (Score:2)
Re: (Score:2)
the video compression is just for shows I record.
For shows you record from OTA, cable, or satellite, it doesn't have to be significantly faster than real time. How many tuners does your PC have? You could put one video encode on each core, plus another core for the audio encodes. But then I confess ignorance as to how much CPU power it takes to encode video at, say, full 1080p/24.
My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.
True, which is why the file system would probably run on the second core of a dual core along with the rest of the "system processes".
Re: (Score:2)
Games are parallelizable and since game programming has a long history of going to extremes for performance, parallel code isn't much of an ask.
Re: (Score:2)
I haven't studied it very carefully, but I do know that 5 cores was significantly faster than real time (re-encodung MPEG2 to mp4) but 2 cores falls behind even if I don't do anything else. There's a lot of trade off there, if I accept less compression or lower quality video, it needs less CPU to accomplish it.
Re: (Score:2)
But then I confess ignorance as to how much CPU power it takes to encode video at, say, full 1080p/24
A lot. 1080P on a Core2Duo running 3.17Ghz, with H.264 you are looking at 3-5 FPS at medium quality and using both cores, the i5's didn't get significantly ( read us-ably, they are faster ) faster, and I doubt the i7's did either.
With H.264 the more cores the better, you get roughly 60-80% speedup per core added. This translates to higher quality encodes at realtime if you start throwing more cores at the encoder.
Does everyone need this? Hell no, but to those of us that could use more cores it would be awe
Re: (Score:2)
I'd say probably 80-90% of normal users are doing video transcoding these days.
The bubble you are living in seems very opaque.
Re: (Score:2)
Sad but true.
Re: (Score:2)
For a lot of people bucketloads of memory is a better deal than large numbers of cores. For others there is not problem pegging all cores at 100% for days on end.
There's been this sort of discussion here ever since th
Re: (Score:2)
Memory access is also a shared resource here, so it can be treated as I/O in a way since it requires going through a shared bus. Some local calculation can be done with local instruction/data cache but there is going to be a lot of banging on that bus. Some modern popular languages are really terrible at making effective use of caches (heavily templated stuff for example). That many cores using a typical asynchronous threading model (ie, the stuff people run on PCs) will be a waste of the chip, better to
Re: (Score:2)
Many multi-core CPUs also have multiple memory channels. If the program has good data locality, that's a big win.
Embarrassingly parallel (Score:5, Informative)
Re:Bitcoin/Litecoin Performance (Score:5, Interesting)
BitCoin has ASIC miners with ~10X the mining power per watt than most programmable alternatives such as GPGPU and FPGA. Anything less efficient than that is or soon will become cost-prohibitive to run.
The newer Bitcoin alternatives use memory-bound algorithms to prevent such a steep mining power escalation since memory capacity and bandwidth scale much more slowly than processing power but much more quickly on costs: with Bitcoin, increasing throughput by 10X simply required 10X the processing power but with the memory-bound alternatives, you also need 10X the RAM and 10X the memory bandwidth.
Re: (Score:2)
Any day now I'm expecting media reports about how there is this nefarious web site called slashdot that is the hub of the bitcoin scam. Let's not have that please.
Re: (Score:2)
Btc and ltc are best run on ASiCs or perhaps AMD GPUs
btc hardware [bitcoin.it]
ltc hardware [litecoin.info]
Perhaps this chip will change things, but for now, cpu mining is pretty inefficient
Re: (Score:2)
QPI vs PCIe? (Score:2)
I just read up QPI on wiki, and it's a point to point processor interconnect, which replaces the front side bus in Xeon and certain desktop platforms - presumably the cores i7. PCIe, OTOH, is a serial computer expansion bus standard, which can take in things like graphics cards, SSDs, network cards and other such peripheral controllers. I just don't see how QPI is any sort of a replacement for PCIe. That would almost be like arguing for PCIe being superseded by USB4 or something.
Essentially, QPI is Int
Re: (Score:2)
API is not meant as a replacement for PCI-e. That's just the technology that links multiple processors together (and memory controllers). KNL is essentially the next generation MIC processor. The current generation is KNC which is a separate PCI-e card. I think it is in that sense that QPI replaces PCI-e.
Re: (Score:2)