Intel Defends AVX-512 Against Critics Who Hope It 'Dies a Painful Death' (pcworld.com) 132

Posted by EditorDavid on Saturday August 22, 2020 @02:34PM from the live-long-and-prosper dept.

"I hope AVX512 dies a painful death," Linus Torvalds said last month, "and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on."

Friday PC World published some reactions from Intel: Torvalds wasn't the only person to kick AVX-512 in the shins either. Former Intel engineer Francois Piednoel also said the special instruction simply didn't belong in laptops, as the power and die space area trade-offs just aren't worth it.
But Intel Chief Architect Raja Koduri says their community loves it because they're seeing a huge performance boost: "AVX-512 is a great feature. Our HPC community, AI community, love it," Koduri said, responding to a question from PCWorld about the AVX-512 kerfuffle during Intel's Architecture Day on August 11. "Our customers on the data center side really, really, really love it." Koduri said Intel has been able to help customers achieve a 285X increase in performance in "our good old CPU socket" just by taking advantage of the extension...

Koduri acknowledged some validity to Torvald's heat, too. "Linus' criticism from one angle that 'hey, are there client applications that leverage this vector bit yet?' may be valid," he said. Koduri explained further that Intel has to maintain a hardware software contract all the way from servers to laptops, because that's been the magic of the ecosystem. "(That's) the great thing about the x86 ecosystem, you could write a piece of software for your notebook and it could also run on the cloud," Kodori said. "That's been the power of the x86 ecosystem..."

And no, hate on AVX-512 and special instructions all you want, Intel isn't going to change direction. Koduri said it will continue to lean on AVX-512 as well as other instructions. "We understand Linus' concerns, we understand some of the issues with first generation AVX-512 that had impact on the frequencies etc, etc," he said "and we are making it much much better with every generation."
They also summarize some performance testing by blogger Travis Downs, saying it found AVX-512 "doesn't appear to enforce much of a penalty at all on a laptops. Downs' testing found the clock speed only dropped 100MHz when using one active core under AVX-512.

"At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions," Downs concluded. "Rather than 'generally causing significant downclocking,' on this Ice Lake chip we can say that AVX-512 causes insignificant or zero licence-based downclocking and I expect this to be true on other Ice Lake client chips as well."

Intel Defends AVX-512 Against Critics Who Hope It 'Dies a Painful Death'

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 132 Comments Log In/Create an Account

Comments Filter:

AMD vs. Intel (Score:4, Insightful)

by ebrandsberg ( 75344 ) writes: on Saturday August 22, 2020 @02:42PM (#60429959)

With AMD, they didn't implement the AVX-512 instructions, and yet, the newer chips provide basically the same performance for a cheaper price simply by throwing more cores at it. Likewise, significant AFX-512 workloads can likely be moved to GPU processing for even better price/performance. As such, it isn't providing much benefit for the vast majority of cases--you can simplify the cpu and throw more cpus, pair it with a decent gpu, and you have a solution that can provide benefits for more users, while still giving the same overall performance even for the specialized cases. It is a solution looking for a problem.

- Re:AMD vs. Intel (Score:5, Informative)
  
  by thegarbz ( 1787294 ) writes: on Saturday August 22, 2020 @02:51PM (#60429983)
  
  the newer chips provide basically the same performance for a cheaper price simply by throwing more cores at it
  Not really. You're comparing a specific instruction for an edge case to general performance. AMD absolutely are pantsing Intel. Intel still have an edge in IPC and single threaded workloads but even that edge may disappear when Zen 3 hits the market.
  If you don't make use of AVX-512 and you have a parallel problem then AMD will win with it's core count. If you occasionally execute an AVX-512 instruction the gap becomes even larger because there's actual performance impacts in executing that instruction.
  However if you have a workload that makes use of AVX-512 heavily, even if the workload is parallel Intel ends up wiping the floor with AMD. The problem is that those workloads are usually even executed on GPUs so really the AVX-512 thing makes little sense.
  
  - Re: AMD vs. Intel (Score:2)
    
    by BAReFO0t ( 6240524 ) writes:
    
    AFAIK the single-threaded edge is already gone.
    - Re: (Score:2)
      
      by thegarbz ( 1787294 ) writes:
      
      Not quite yet, but honestly it's so close as to be irrelevant in practical terms clock for clock. The only reason Intel really still tops all the single core charts is due to having a far higher single core turbo boost than AMD.
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        The only reason Intel really still tops all the single core charts is due to having a far higher single core turbo boost than AMD.
        That's part of it, but the 6% higher core speed doesn't account for the nearly 15% performance delta.
        Intel cores also have vastly superior IPC, which is offset by better data path subsystems tacked onto Zen2.
        Really- the performance of the 2 cores is wildly different than benchmarks allude to, and both really do perform different tasks at wildly different levels of proficiency.
    - - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        The ipc edge is gone
        Negative, it is not.
        Amd chips are getting more done per clock
        Now this is in some cases true, but it's not related to how many instructions it can pump through. It has more to do with how much data they can access and how quickly.
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
    - Re: (Score:2)
      
      by thegarbz ( 1787294 ) writes:
      
      From what I understand the AVX-512 instruction is an attempt to shoehorn GPU style processing in a CPU. There's no point doing it on the GPU since that's what the GPU already excels at and doesn't need a special instruction.
    - Re: (Score:2)
      
      by DrMrLordX ( 559371 ) writes:
      
      No, and yes. GPGPU doesn't need/want anything to do with AVX512.
      Maybe in some dystopian alternate reality where Larrabee took off as a GPU, yes, that could have happened.
    - Other way around (Score:3)
      
      by DrYak ( 748999 ) writes:
      
      Speaking of...could Intel's implementation of AVX-512 set the foundation for later integration into the iGPU for better graphic performance?
      The other way around.
      AVX512 was born out of their older failed attempt at making an dGPU (project Larrabel):
      as GPGPU computation started to be popular back then, but were still a bit cumbersome (most was done by abuse OpenGL, and a little bit of the early low-level API available such as the BrookGPU implementation running atop of AMD's CTM) the idea Intel had was to pair a very large amount of very simple cores, each with extremely large SIMD units: you still got the ultra-wide SIMD popular on GPUs, but as
  - - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      No, 10th generation Core intel has between 1.5x-2x higher IPC, where IPC means Instructions Per Cycle.
      That is of course caveated by the fact that there is more to the performance of a CPU than merely IPC.
      AMDs are generally paired with faster RAM as an example, coming very close to evening out its severe IPC deficit.
- Re: (Score:2)
  
  by Z00L00K ( 682162 ) writes:
  
  I'm not sure that a GPU is the right place to handle that level of precision though.
  There are likely some use cases where it's useful, but I don't think that mainstream processors shall have the AVX-512.
  - Re:AMD vs. Intel (Score:5, Interesting)
    
    by Rockoon ( 1252108 ) writes: on Saturday August 22, 2020 @04:28PM (#60430149)
    
    Mainstream processors can effectively use AVX-512 .. in about 5 years
    
    The entire thing was born out of the larabee project, when that project was about rendering. What Intel found was that no matter what they did they could not feed that much data to the CPU without changing the cache architecture, and that such changes to the cache architecture would negatively effect regular performance with crushing memory latency.
    
    So we end up in a situation where Intel knew that they would not be able to process entire AVX-512 registers in one go on all threads, so did not include the execution units necessary to do it even on a single core, let alone have the bandwidth to do it on all of them.
    
    So as Linus rightly notes, the shit is more or less useless right now, and costs a lot of execution time because AVX-512 registers are enormous and like all registers need saving between context switches, saving that is slow because of that lack of bandwidth. A single AVX-512 register is as large as all the general purpose registers combined.
    
    There is no solution other than time.
    
    - Re: (Score:2)
      
      by radarskiy ( 2874255 ) writes:
      
      "like all registers need saving between context switches, saving that is slow because of that lack of bandwidth."
      Architectural registers are not physical registers. While the register allocation table is usually though of as enabling out-of-order execution, it also aids context switching.
  - Re: (Score:2)
    
    by DrMrLordX ( 559371 ) writes:
    
    Some "mainstream processors" already use AVX512. It's implemented in Cannonlake, IceLake, and TigerLake (with TigerLake having the broadest selection of AVX512 instructions; AVX512 is a mess of different sub-standards). The trick behind the "consumer" implementations of AVX512 is that total width of vector units didn't improve tall from Haswell onward. You aren't going to see a whole lot of performance benefit using AVX512 over AVX2 on any modern Intel design, since the only way Cannonlake/IceLake/TigerL
- Re: AMD vs. Intel (Score:2)
  
  by Rewind ( 138843 ) writes:
  
  I think I agree with Linus here, but as far as âoejust let the GPU do it!â, I think the whole point is for systems that donâ(TM)t have a powerful / discrete GPU to throw those to.
  - Re: AMD vs. Intel (Score:2)
    
    by BAReFO0t ( 6240524 ) writes:
    
    Unless they are AMD APUs, of course. :)
    See: Game consoles.
- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  Same performance? Intel doesn't have anything faster than Rome, and Rome was LAST year's server CPU. Milan would like a word with you.
Re: (Score:1)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- Re:Linus isn't really a floating-point kind of guy (Score:5, Interesting)
  
  by Bengie ( 1121981 ) writes: on Saturday August 22, 2020 @03:16PM (#60430019)
  
  Linus doesn't care about the kernel, he cares about user space, and through the kernel is how he cares for user space. There is software that runs in the user space that uses AVX-512, like CHACHA acceleration that makes it comparable with AES-NI in performance. Except that AVX-512 code only executes for a few microseconds, but the entire CPU slows down for milliseconds. Because there is a constant stream of packets, the CPU is permanently slowed down. This is bad enough the the overall server performance is actually lower than not using AVX-512. It's incredibly difficult to debug and required getting the kernel people involved to understand why their shiny new server cpu was running slower.
  
  - Re: (Score:3)
    
    by Kaenneth ( 82978 ) writes:
    
    So what is " licence-based downclocking", does it downclock some CPUs not for heat or power management, but because it would be too fast for the tier the user paid for?
    - Re: (Score:2)
      
      by canajin56 ( 660655 ) writes:
      
      Intel puts instructions into "licence classes" as an analogy to drivers licences. The slow instructions (AVX-512) are in Licence 2 in Sky Lake chips which restricts the CPU from certain frequency scaling ranges (300-800 MHz penalty). Supposedly Ice Lake's licence penalties are less severe.
- Re: (Score:2, Informative)
  
  by phantomfive ( 622387 ) writes:
  
  Is this really something you would use for a tensor flow model? Who is using 512 bit FP operations in machine learning? AFAIK everyone is reducing precision in the machine learning space, to improve performance. Google's TPUs went as far as omitting FP and doing only integer math (useful when running the model, but you probably want more precision during training).
  - Re: (Score:2)
    
    by Z00L00K ( 682162 ) writes:
    
    From my perspective the only really practical use of 512bit FP would be for astronomical data processing, but I think that a completely different HW architecture overall would be needed anyway for that to avoid the constraint that a 64 bit data bus is in the main processor. Or maybe reserve it for processors with 8 memory channels and make those processors more or less 512 bit processors.
    For general purpose processing I'd say that Linus is right. In general purpose processing it's more useful with more core
    - Re: (Score:3)
      
      by DrMrLordX ( 559371 ) writes:
      
      AVX512 can provide benefits in any number of HPC applications - but so can competing ISA extensions like SVE2. SVE2 is a much-more-elegant system. Also, Intel's implementation of AVX512 on processors that can actually use it fully (Skylake-SP, Cascade Lake, Cooper Lake, upcoming IceLake-SP, etc.) have to downclock due to the extra power draw/heat generation. Intel hasn't developed a sophisticated method for determing how many cores are running those instructions or how often, so any time those instructio
  - Re: (Score:2, Funny)
    
    by Rockoon ( 1252108 ) writes:
    
    Who is using 512 bit FP operations in machine learning?
    Nobody... so why are you asking?
    AFAIK everyone is reducing precision in the machine learning space, to improve performance.
    I see. You are asking because you are completely ignorant. You think AVX-512 does 512-bit floating point. Then, with that so accurate knowledge, decided that you would pretend to be an expert on slashdot.
    
    Its called SIMD you fucking pretending dishonest fuck.
    - Re: (Score:2)
      
      by phantomfive ( 622387 ) writes:
      
      Who is using 512 bit FP operations in machine learning?
      Nobody... so why are you asking?
      Because the guy from Intel said the AI community likes it. Apparently you didn't even read the summary, you dumb person.
      - Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        Because the guy from Intel said the AI community likes it. Apparently you didn't even read the summary, you dumb person.
        Nobody in the AI community is using 512 bit floating point, which is fine because AVX isn't doing 512 bit floating point. What people do like is being able to perform 512 bits worth of float32 operations in parallel.
  - Re: (Score:2)
    
    by noodler ( 724788 ) writes:
    
    You really should read what this AVX-512 stuff is. Hint, it is not a 512 bit long float.
  - Re: (Score:2)
    
    by serviscope_minor ( 664417 ) writes:
    
    Is this really something you would use for a tensor flow model?
    Yes, definitely.
    Who is using 512 bit FP operations in machine learning?
    No one, but what's that got to do with AES?
    AFAIK everyone is reducing precision in the machine learning space, to improve performance.
    Yes, and...? AVX does integer operations too.
    - Re: (Score:2)
      
      by phantomfive ( 622387 ) writes:
      
      Yeah I don't know what I was thinking when I posted. I wish I could blame pot but I wasn't even high.
      - Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        lol fair enough. I've said some pretty dumb shit on here stone cold sober.
  - Re: (Score:2)
    
    by DrMrLordX ( 559371 ) writes:
    
    AVX512 is a mess of instruction subsets. The ML-related ones are stuff like bfloat16 (present in Cooper Lake). Look it up.
    - Re: (Score:2)
      
      by phantomfive ( 622387 ) writes:
      
      Thanks
- - Re: Linus isn't really a floating-point kind of gu (Score:4, Insightful)
    
    by NagrothAgain ( 4130865 ) writes: on Saturday August 22, 2020 @03:00PM (#60429997)
    
    You need to differentiate between his personal opinions and his technical opinions. Simply saying "if Linus likes/hates it, so should you!" is Religion and has no business in a discussion of the technical merits of a platform.
    
  - Re: (Score:3)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
  - Re: Linus isn't really a floating-point kind of gu (Score:2)
    
    by BAReFO0t ( 6240524 ) writes:
    
    You seriously need to learn you logical fallacies, kid.
    Like "argument from authority".
    Parent's argument was, that it's the sub-aread where Linus does not have a clue. Which was the same fallacy aswell.
    In the end, Linus was the only one making actual arguments. You may show counter-arguments, of you got some. Otherwise, why don't you two monkey brains shut up?
That's not a retort. (Score:2)

by thegarbz ( 1787294 ) writes:

The critics say AVX-512 has no place in desktops and laptops. Countering that the HPC and AI community love it is sort of supporting the points of the critics. Throw it in Xeons and special purpose chips, leave it out of the rest. Maybe then you could save some money and be price competitive with AMD.
Right now you'd be mad to build a general purpose computer based on an Intel CPU.
- Re: (Score:1)
  
  by Lonewolf666 ( 259450 ) writes:
  
  Right now, AVX-512 is only supported in Intel's HEDT systems and some Xeon Phi models: https://en.wikipedia.org/wiki/AVX-512 [wikipedia.org]. Which makes it very niche indeed. No testing something at home on your old i7.
  - Re: (Score:2)
    
    by AzN1337c0d3r ( 997208 ) writes:
    
    Ice Lake client is out and has more support for AVX-512 than even the HEDT systems right now (see VBMI2).
    https://en.wikichip.org/wiki/i... [wikichip.org]
    - Re: (Score:2)
      
      by DrMrLordX ( 559371 ) writes:
      
      Intel actually implemented AVX512 in Cannonlake as well. TigerLake will have it. Sadly, it won't offer much performance since, unlike those HEDT systems and server systems (Skylake-SP, Cascade Lake, Cooper Lake), the consumer CPUs I mentioned above only support 512b SIMD via op fusion. 2x256b vs 2x512b and all that.
      You're not THAT much better off running AVX512 on IceLake-U than you are AVX2.
- Re: (Score:2)
  
  by fazig ( 2909523 ) writes:
  
  Given by how useful AVX 2 has been except for some fringe applications, it seems reasonable to assume something similar for AVX 512.
  There maybe some uses where it really shines, but it's simply not common enough to advertise it as a useful feature for the broader market of even professionals.
  
  The only application in my personal workflow that makes notable use of AVX 2 is Blender in the Cycles renderer. And at that my Ryzen 9 3900X at stock speeds has still more processing power than a 25% more expensive i
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    You have no idea all the places you're using AVX2.
    Every compiler around supports auto-vectorization and uses it in their equivalent of -O3.
    In kernels, they're used to move data around/compare/transform with less instructions- resulting in significant speedups, like Netfilters ~420% speedup using AVX2 on Rome cores.
    And at that my Ryzen 9 3900X at stock speeds has still more processing power than a 25% more expensive i9 10900k.
    Aggregate, yes. But core-for-core, the Intel walks you like a dog.
    That his its own applications (generally anything that performs better with lower instruction latency will perform better in ca
    - Re: (Score:2)
      
      by fazig ( 2909523 ) writes:
      
      According to Chromium compile times the 3900X is also about 27% faster than the 10900k.
      
      Do you mean Blender viewport performance? I explicitly mentioned Blender Cycles. That's a situation where people spend a lot of time waiting for the renders to be finished at a high power draw of their system unless they send their files to a remote render farm. Show me the benchmarks where the 10900k is faster there.
      I'll go ahead and start by showing data that supports what I wrote by simply googling "10900k blender"
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        According to Chromium compile times the 3900X is also about 27% faster than the 10900k.
        I never argued that the 3900X didn't have more aggregate CPU cycles to throw at a problem. It has 2 more cores and 4 more hardware threads.
        I argued that core-per-core, it walks the 3900X like a dog. And it does.
        But let's address your 27% claim- because frankly, it's horse shit.
        Let's look at the source of it- the gamersnexus article.
        Notice that the 3900X (stock) is literally the same as the 3900X overclocked? Ya. They fucked up.
        They're not measuring stock speed. So it's logical we should be comparing ov
        
        Re: (Score:2)
        
        by fazig ( 2909523 ) writes:
        
        You realize that I didn't argue otherwise when it comes to core-to-core performance?
        Apparently not. You saw a post by someone you thought to be an Intel hater and had to counter with some "Real World Performance" nobody asked for and which isn't backed up by data either.
        
        My point is that I care little for core-to-core performance here. In these workloads, that make me and many others that work in that field money, AMD offers more cores with a higher total performance which also have a lower power draw fo
- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  There's nothing about AVX512 that's making Intel consumer CPUs "more expensive". Cannonlake/IceLake/TigerLake support AVX512 via op fusion. They have the same 2x256b config that Intel has used since Haswell. AVX512 provides no real performance advantage for any of Intel's current lineup outside of Cascade Lake/Cooper Lake (where Intel has implemented 2x512b configs).
  Skylake-S, CoffeeLake-S, and Comet Lake-S (and any of the mobile derivatives of those CPU groups) do not support AVX512 at all, so it's a mo
Summary (Score:2)

by JustAnotherOldGuy ( 4145623 ) writes:

So, one of these things has more things than the other thing, and the other thing doesn't even have the thing so everyone is arguing about the thing and if it's really a thing. Intel spokesperson says, "Yes", it will be a thing.
At least that's what I got out of it. Oh and it has something to do with graphics or numbers or something like that.
- Re: Summary (Score:2)
  
  by BAReFO0t ( 6240524 ) writes:
  
  Sure, if you lossy-reduce a discussion to uselesness, it's gonna be useless. You might aswell reduce it further, to "x". Or to (bottom) aka undefined. (In Haskell parlance.)
  Maybe it's just you not getting things. Maybe you need some coffee. Or do something more fun to you.
  - Re: (Score:2)
    
    by JustAnotherOldGuy ( 4145623 ) writes:
    
    No no I get it- the first thing has twice as many things that the other one doesn't have.
  - Re: (Score:3)
    
    by ufgrat ( 6245202 ) writes:
    
    The irony is that "EditorDavid" committed one of the most fundamental errors an "editor" (in the journalistic sense) can make.
    A whole effin' summary THAT NEVER EXPLAINS WHAT THE *EXPLETIVE* AVX-512 EXTENSION IS!!!!!!
    This thing! This terrible thing! Linus hates it! Intel Loves it! And we can't be arsed to tell you what it is!!!
    I used to program in Assembly. I consider myself mildly knowledgeable about x86 architecture. I don't know what avx-512 is. My first thought was that it must be an audio/video c
    - Re: (Score:2)
      
      by DrMrLordX ( 559371 ) writes:
      
      AVX512 is a mess of instruction sub-sets that's actually hard to explain. In its simplest terms, it's suppoesed to be 512 bit SIMD. A simple example would be adding 32-bit fp values to one another to produce a single sum.
      128b SIMD lets you add four values at once. 256b SIMD lets you add 8 values at once. 512b SIMD lets you add 16, and so forth. The wider your vector implementation, the more stress you put on load/store blah blah blah you should know the drill by now right?
I don't care! My next custom built (Score:2)

by oldgraybeard ( 2939809 ) writes:

workstations will be using AMD. I had to smile when I saw
in "our good old CPU socket"
and
"Our customers on the data center side really, really, really love it."

The second and third "really" sold me right there. Such a powerful selling point coming from a marketing dweeb. More useless marketing double speak.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by oldgraybeard ( 2939809 ) writes:
    
    I stand corrected! thxs for the smile ;)
- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  AMD will probably support AVX512 eventually, unless they do something radical and try to force the x86 world to switch to SVE2 (which I wish they would).
What the hell is AVX-512? (Score:3)

by bluegutang ( 2814641 ) writes: on Saturday August 22, 2020 @02:56PM (#60429989)

Can someone explain, in language understandable by a tech person with no particular experience in CPU or compiler design, what AVX-512 is, what need it is supposed to fill, and why exactly people are criticizing it?

- Re: (Score:3, Informative)
  
  by douglasfir77 ( 6439950 ) writes:
  
  Intel AVX-512 is a set of new (2013) CPU instructions that impacts compute, storage, and network functions. ... Intel AVX-512 enables twice the number of floating point operations per second (FLOPS) per clock cycle compared to its predecessor
  - Re: (Score:2)
    
    by Rockoon ( 1252108 ) writes:
    
    Intel AVX-512 enables twice the number of floating point operations per second (FLOPS) per clock cycle compared to its predecessor
    Incorrect.
    
    Execution units do that. 512 bit registers are not required.
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      Huh?
      Are you really about to argue that we can forgo vectorization and simply add more execution units?
      
      That'll be awesome. I, personally, can't wait for pipelines so long that a stall takes seconds to resolve.
      Or were you hoping the execution units would peer into the future?
- Re:What the hell is AVX-512? (Score:5, Informative)
  
  by Carewolf ( 581105 ) writes: on Saturday August 22, 2020 @03:15PM (#60430013) Homepage
  
  GPU-like processing on the CPU, but not as fast as on a GPU. And it still requires specially optimized programs.
  
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    And it still requires specially optimized programs.
    This made me laugh. Only someone who has never written an OpenCL/CUDA kernel would say something that fucking stupid.
    Perhaps you should limit yourself to offering opinions on things you actually know something about.
  - - Re: (Score:3)
      
      by Carewolf ( 581105 ) writes:
      
      Except when it needs to warm up the AVX-512 unit. It really only works when it is doing big batches, for small pieces of AVX-512 code you have to wait for it to spin the special unit on the CPU up and clock the CPU down, then run the code.
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Except when it needs to warm up the AVX-512 unit.
        Ya, you have a good point.
        Warming up the AVX-512 unit is super expensive compared to the startup cost of the shader cores. </sarcasm>
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: What the hell is AVX-512? (Score:2, Informative)
    
    by BAReFO0t ( 6240524 ) writes:
    
    And that is deliberately misleading, as, in a typical Intel fashion "It's better because we shifted the shit out to the stuff around it, that we hid, so it looks better". In other words: The instruction needs a special mode and special caching behavior and even then the CPU cannot feed it constantly, so you only get that speed for a short time, before and after which the CPU will waste so much time on related stuff, that in total it is much slower. And if you try to do it for longer, to make it worth it, it
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Re: (Score:2)
        
        by DrMrLordX ( 559371 ) writes:
        
        You should be looking at SVE2 instead.
  - - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Re: (Score:2)
        
        by default luser ( 529332 ) writes:
        
        They didn't hobble Titan RTX.
        The Titan V uses Volta, while Titan RTX uses Turing.
        One is a rebranded professional card, while the other is a rebranded RTX 2080 Ti.
        The pro card has dedicated 64-bit units, while the consumer card ditched that for a bit higher FP32 performance and dedicated Raytracing. units.
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
    - Re: (Score:2)
      
      by Jeremy Erwin ( 2054 ) writes:
      
      Nvidia A100:
      9.7 TFLOPs FP64
      19.5 TFLOPS FP32
      78 TFLOPSS FP16
      nvdia marketing materials [nvidia.com]
- Re: (Score:2)
  
  by AmiMoJo ( 196126 ) writes:
  
  These are vector instructions, meaning that they operate on several things at once. So for example say you have 16 sets of multiplications to do, you can do them all with one instruction.
  The advantage is that the CPU only needs to read and decode one instruction and then all 16 multiplications happen at the same time in parallel. That's really useful for a lot of things.
  The first version of AVX had 64 bit registers (or was it 32?) and Intel kept increasing their size. 512 is huge and few things benefit from
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
- Re:What the hell is AVX-512? (Score:5, Informative)
  
  by drkshadow ( 6277460 ) writes: on Saturday August 22, 2020 @05:06PM (#60430223)
  
  Current responses are lacking, so..
  AVX-512 is a SIMD instruction. Single-instruction, multiple-data, rather than having to decode instructions, fetch data from memory, and executing each instruction in sequence -- it'll fetch a block of memory, and perform the (limited set of) instruction on the memory.
  It started with MMX, which back in 1997 (according to Wikipedia) allowed you to process an array of integer data -- something like four pieces of data at once, perform an integer operation on all of them, and in the space of a single multiply or add, you get four of them done. Great! This helped with image decoding, audio processing (how many voices in a game?), DVD playback, and the list goes on.
  AMD had their own implementation with 3DNow!. These grew up, through MMX2, MMX3, 3dNow2, and blah.
  It was seen how useful these SIMD instructions were, but they were just so limited - add, subtract, multiply, divide of integers. What about floating point? This is where the successors come in -- SSE. According to the 'pedia, SSE2 was introduced with the P4, so I'm guessing SSE itself was introduced on the Pentium 3 (which seems to have introduced a _lot_ of new architecture).
  Well, SSE was killer, too -- it helped with better precision for floating point, started getting used in games, and cheap AI purposes. Cool!
  SSE is old news, though. SSE4 is probably the last I've seen mentioned (Sept 2006?), and Intel keeps introducing new things, and growing up. Really, Intel was hurting -- all of these nVidia cards being used for AI and HPC purposes -- Intel wanted a cut of that HPC market, which they were completely locked out of. Intel doesn't have a discrete graphics processor, and the Intel Iris Graphics just isn't going to cut it for HPC purposes.
  And so: Intel AVX instructions.
  The AVX instruction set is really about doing on the CPU -- without OpenCL, CUDA, or another compiler/language/processing environment, without copying data over the PCIe bus, without referencing a discrete component -- HPC-things that needed to be done on the graphics cards. Remember, graphics cards are about processing _lots_ of data at once (300-2000 shaders consumer, up to 7000 cores HPC), and AVX is trying to pick up some of that (AVX-32,64,128, AVX-256, and AVX-512). These are in-CPU, smaller versions of the graphics card, without the linguistic changes, or scheduling, or DMA, or other complexities associated therewith.
  The benefits are clear -- if you're hashing something, you can do that very quickly with the AVX instructions, especially without any latency of copying data to another device.
  The drawbacks are less-clear, but very apparent: graphics cards are rated to 300 watts. You're now trying to stuff a portion of that processing power into the CPU, and back in the early 2010's, benchmarking showed this to cause the CPUs to run VERY hot. Much hotter, much more quickly than the heat sink could cool them. (I worked at a computer manufacturer -- running Prime95 with AVX instruction set would regularly cause problems.) Apparently, from other comments, the CPU also doesn't have the memory bandwidth to fetch the data quickly enough. Remember, graphics cards use High Bandwidth Memory now to supply up to 1500 shader cores. Really, with AVX, the memory bus can't keep up -- unless you're doing thousands of iterations over the same, cached data, you can do one instruction and then you have to wait.
  So -- with AVX, Intel got a serious performance boost for games, graphics, RAID, AI workloads, encryption, compression, and so on.
  
  - Re: (Score:2)
    
    by pendolino ( 6185100 ) writes:
    
    That's a great post, but it doesn't explain why SSE and AVX were great extensions that everybody adopted without question and AVX-512 is not.
    SSE allowed processing of (among others things) vectors of 4 32 bits FP numbers (16*128 bits registers, 256 bytes total). You don't need to look very far to find applications: anything 3d. Very useful, so.
    What SSE _also_ allowed what to do finally forget about the x87 FPU. That was a mess because it used a stack architecture, which meant _a lot_ of x87 instructions wer
- Re: (Score:3)
  
  by Linux Torvalds ( 647197 ) writes:
  
  If your computer sits around all day solving linear algebra problems, you'd like your dot products to run in O(1) time instead of O(number of elements). That's what AVX and similar instruction sets do. They make your CPU act less like a Commodore and more like a Cray.
  However, if you do anything else, you would prefer your CPU vendor to spend the considerable transistor budget associated with vector instruction sets on something else.
  The thing is, almost every interesting problem boils down to linear algeb
- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  It's SIMD - Single Instruction Multiple Data
  Intel has been working on SIMD since MMX in 1995/96.
  You should be familiar with some of these terms: SSE, SSE2, SSE3, SSE4/4.1a/4.1b, XOP, AVX, AVX2
  Each one is a selection of CPU instructions you can use in some capacity to carry out various floating-point (or in some cases, integer) operations on multiple points of data without using more than one instruction.
  In general terms, the bit-width of the SIMD standard determines how much data can be processed in one ins
- - Re: (Score:2)
    
    by Kaenneth ( 82978 ) writes:
    
    How many AVX-512 units are there on a die? If I'm trying to do Mandelbrot on 8 threads (4 cores), will each thread stall until a AVX unit is available or what?
    - Re: (Score:2)
      
      by DrMrLordX ( 559371 ) writes:
      
      Per core? Depends on the CPU!
      Cannonlake, IceLake, and Tigerlake have 2x256b per core, and perform AVX512 via op fusion in hardware. No real gain there over AVX2 unless there's some instruction in AVX512 you're just dying to use.
      Skylake-SP, Cascade Lake-SP/AP, and Cooper Lake all have 2x512b per core. They carry out AVX512 natively, though they are the ones that downclock at the first sign of an AVX512 instruction. Sometimes substantially. Unless you're smashing all cores non-stop with AVX512 instruction
How? (Score:2)

by phantomfive ( 622387 ) writes:

"Our customers on the data center side really, really, really love it." Koduri said Intel has been able to help customers achieve a 285X increase in performance in "our good old CPU socket" just by taking advantage of the extension...
The only way I can think of where this would be useful at all in a datacenter would be with encryption, so encrypting HTTPS connections. That's a small portion of total cost in most datacenters, though.
Only 100MHz (Score:1)

by GioMac ( 862536 ) writes:

From 2200? That's a lot!
- Re: (Score:2)
  
  by evanh ( 627108 ) writes:
  
  More importantly, the dude is being deceptive, since the one thing you want to do with such instructions is parallel them up - And that includes using all cores!
accomplishments (Score:1)

by Mr.Bosski ( 7153993 ) writes:

You're allowed to use your accomplishments as credentials to bolster a position you take. Obviously you still need to back it up, but it certainly gives your position credibility.
Effect on task switching? (Score:2)

by johannesg ( 664142 ) writes:

Pushing / popping that many bytes must give a noticable hit on task switching as well, and everyone is paying it, even if you aren't using AVX512. Anybody have any numbers on how large the effect is?
- Re: (Score:3)
  
  by Rockoon ( 1252108 ) writes:
  
  Pushing / popping that many bytes must give a noticable hit on task switching as well, and everyone is paying it, even if you aren't using AVX512. Anybody have any numbers on how large the effect is?
  Thats one of the things Linus was complaining about. Real negative performance metrics combined with almost no realizable benefit. There are no AVX-512 supporting processors that cant do the same work in the same time using SSE's 128-bit registers instead.
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    There are no AVX-512 supporting processors that cant do the same work in the same time using SSE's 128-bit registers instead.
    That's downright absurd.
    Why are you making shit up?
- Re: (Score:2)
  
  by Carewolf ( 581105 ) writes:
  
  I assume the CPU has some way of telling the OS if it has touched the AVX-512 registers. They are not just double size, but there are also twice as many of them. Saving them all would take 4 times as long as normal.
Sure, sure... now show us that HPC laptop... (Score:2)

by BAReFO0t ( 6240524 ) writes:

HPC is a marketeering wank term anyway, like "cloud".
No person of clue uses it.
AVX-512's problem is that it is incoherent (Score:5, Interesting)

by brunos ( 629303 ) writes: on Saturday August 22, 2020 @04:51PM (#60430205)

When I first read Linus' post, I dismissed it as one of his rants. Then, I tried to optimized our code (image compression) for AVX-512. The problem is that each different "AVX-512" Intel processor implements only part of the spec, making it a nightmare from all points of view: programming, testing, and even marketing i.e., I can't even say to customers " if you have AVX-512, you will get performance X". That makes AVX2 a much better option, as it is the same across most processors (including AMD). Where performance and parallelism are really needed, a GPU is at least one order of magnitude faster. I agree that the idea of having a 512-bit (or more) simd is nice, but the implementation is just terrible.

- Mod this post up to 11 (Score:2)
  
  by Latent Heat ( 558884 ) writes:
  
  Spot on. If Intel actually implemented this in a way useful to the developer of a software application that was not restricted to one processor of one generation, maybe there would be a use for it.
- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  Take a look at SVE2. It's much cleaner than AVX512.
What some blogger found is irrelevant (Score:2)

by ReneR ( 1057034 ) writes:

it is fact that AVX* goes mostly unused, because it would require specific support by each compilation, and most of the OS, browser, email you name it will mostly not have special AVX support. Given this the rather large die area occupied by AVX* would be better used for another core, or higher clock if not present, or larger buffers and caches, etc. A proper, scalable vector extension would also have been nicer than the mess Intel created from MMX, over SSE* to AVX* and soon bfloat16 and soon AMX, sigh: ht [youtube.com]
Intel Who? (Score:4, Insightful)

by nagora ( 177841 ) writes: on Saturday August 22, 2020 @05:57PM (#60430331)

When was the last time anyone cared what Intel engineers claimed?

- Re: (Score:2)
  
  by shentino ( 1139071 ) writes:
  
  Probably the same day intel engineers cared about anything but the benchmarks they were whoring out to.
It's a processor with added purpose built instr. (Score:2)

by Malays2 bowman ( 6656916 ) writes:

Wow, I can't believe that people are getting so fanatical (as in religious) over this!
Really, now we need to rag on a processor because it has added instructions meant to accelerate a specific kind of task?!
Give it a rest!
nt (Score:2)

by shentino ( 1139071 ) writes:

My biggest concern with Intel is that its marketing and/or feedback metrics might be too closely tied to benchmarks and not closely tied enough to practical performance concerns. I'd also ask, however, who their intended market is. If their customers care enough about benchmarks to push their sales, then intel kinda has to follow the money if it wants to stay in business. If they're selling to people who care about benchmarks, then their marketing needs to satisfy it.
The real reason why Intel has AVX-512... (Score:2)

by fintux ( 798480 ) writes:

The reason reason why Intel has AVX-512 and keeps representing it as a very special, needed thing is that it's probably the only differentiating feature they have compared to AMD. On the contrary, they have less features. They don't have PCIe 4.0, they don't have more cores, they don't have support for ECC RAM in as wide range of products, they don't have full memory encryption - those are some things I can name from the top of my head. They only have a bit higher single-threaded performance (at the cost of
- - Re: IDS (Score:3)
    
    by BAReFO0t ( 6240524 ) writes:
    
    Don't you mean the deep state and branch prediction? ;)
- Re: (Score:2)
  
  by shentino ( 1139071 ) writes:
  
  LOL goat.cx got scooped by an ad-pumping domain scalper!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

AMD vs. Intel (Score:4, Insightful)

Re:AMD vs. Intel (Score:5, Informative)

Re: AMD vs. Intel (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Other way around (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:AMD vs. Intel (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: AMD vs. Intel (Score:2)

Re: AMD vs. Intel (Score:2)

Re: (Score:2)

Re: (Score:1)

Re:Linus isn't really a floating-point kind of guy (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Linus isn't really a floating-point kind of gu (Score:4, Insightful)

Re: (Score:3)

Re: Linus isn't really a floating-point kind of gu (Score:2)

That's not a retort. (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Summary (Score:2)

Re: Summary (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

I don't care! My next custom built (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

What the hell is AVX-512? (Score:3)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re:What the hell is AVX-512? (Score:5, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: What the hell is AVX-512? (Score:2, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:What the hell is AVX-512? (Score:5, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)