Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Desktops (Apple) Hardware Apple

Mac Studio's M1 Ultra Chip Outperforms on Computational Fluid Dynamics Benchmarks (hrtapps.com) 63

Dr. Craig Hunter is a mechanical/aerospace engineer with over 25 years of experience in software development. And now Dixie_Flatline (Slashdot reader #5,077) describes Hunter's latest experiment: Craig Hunter has been running Computational Fluid Dynamics (CFD) benchmarks on Macs for years--he has results going back to 2010 with an Intel Xeon 5650, with the most recent being a 28-core Xeon W from 2019. He has this to say about why he thinks CFD benchmarks are a good test: "As shown above, we see a pretty typical trend where machines get less and less efficient as more and more cores join the computation. This happens because the computational work begins to saturate communications on the system as data and MPI instructions pass between the cores and memory, creating overhead. It's what makes parallel CFD computations such a great real world benchmark. Unlike simpler benchmarks that tend to make CPUs look good, the CFD benchmark stresses the entire system and shows us how things hold up as conditions become more and more challenging."

With just 6 cores, the Mac Studio's M1 Ultra surpasses the 2019 Xeon before literally going off the original chart. He had to double the x-axis just to fit the M1's performance in. Unsurprisingly, he seems impressed:

"We know from Apple's specs and marketing materials that the M1 Ultra has an extremely high 800 GB/sec memory bandwidth and an even faster 2.5 TB/sec interface between the two M1 Max chips that make up the M1 Ultra, and it shows in the CFD benchmark. This leads to a level of CPU performance scaling that I don't even see on supercomputers."

This discussion has been archived. No new comments can be posted.

Mac Studio's M1 Ultra Chip Outperforms on Computational Fluid Dynamics Benchmarks

Comments Filter:
  • Nice that he's found the right tool for doing CFD. For the rest of us, if you want the best gaming rig, find the machines that handle games best. If you do CFD, likewise, find the machines that handle that best. Doing deep neural nets? Yeah, you guessed it. Surfing the web, typing documents, watching videos - get the one that you like the look of, doesn't cost too much, & has a long battery life. Simply put, choose the right tool for the job.
    • by dfghjk ( 711126 )

      or...

      "It's what makes parallel CFD computations such a great real world benchmark. Unlike simpler benchmarks that tend to make CPUs look good, the CFD benchmark stresses the entire system and shows us how things hold up as conditions become more and more challenging."

      It's what makes parallel CFD computations such a great real world benchmark IF THAT"S WHAT YOUR APPLICATION DOES. Otherwise, simpler benchmarks may be more indicative.

      Also, this Apple solution may be good if you need to do CFD ON A DESKTOP. I

      • Re:yes (Score:5, Interesting)

        by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Saturday April 30, 2022 @03:21PM (#62492500) Homepage Journal

        It's kind of hard to buy what he's saying about supercomputers, because he's only claiming 800GB/sec memory bandwidth while modern compute GPUs deliver twice that or more, and you can do CFD on them. Are people who really need maximum performance still even doing CFD on CPUs? I'm seeing claims that single high-end compute-specific GPUs with HBM are handily outperforming CPUs with many cores, or even whole clusters of PCs doing CPU-based CFD. For example, Performance comparison of CFD-DEM solver MFiX-Exa, on GPUs and CPUs [arxiv.org] , e.g.

        A single GPU was observed to be about 10 times faster compared to a single CPU core. The use of 3 GPUs on a single compute node was observed to be 4x faster than using all 64 CPU cores.

        Or perhaps GPU Acceleration of CFD Algorithm: HSMAC and SIMPLE [sciencedirect.com] ...

        [...]we implemented the HSMAC and SIMPLE algorithms on GPU. For the simulation of 2D lid-driven cavity flow, the GPU version could get a speedup up to 58x and 21x respectively with double precision, and 78x and 32x with single precision, compared to the sequential CPU version.

        It seems like CFD on CPU is irrelevant ATM. But maybe someone who does it for a living will tell me why I'm wrong :)

        • Re:yes (Score:4, Interesting)

          by serviscope_minor ( 664417 ) on Saturday April 30, 2022 @03:46PM (#62492552) Journal

          It seems like CFD on CPU is irrelevant ATM.

          Fujitsu might like a word with you.

          On the other hand those aren't exactly run of the mill CPUs. Very wide vector units, HBM2 memory and 68GB/s of networking.

        • Re:yes (Score:5, Informative)

          by DamnOregonian ( 963763 ) on Saturday April 30, 2022 @05:30PM (#62492788)
          I don't do CFD for a living, but I do do processing of very, very large datasets with GPU assist (OpenCL).
          I have begun doing these on my MacBook Pro (M1 Max, 10 core (8P+2E), 64GB) just to see how it does.

          CPU wise, the Max is pretty fucking good. It lays pretty solid waste to my desktop counterparts.
          GPU is another story, though.
          I just upgraded from a 2080 Ti to a 3090 Ti (Because it's the first fucking graphics card I've seen for sale at MSRP in years, now)
          Still waiting on a power supply beefy enough to run int.

          My 2080Ti outperformed my M1 Max in computational tasks by around 400%. I expect the 3090 Ti will approximately double to triple that.

          The author describes the memory bandwidth, which I find interesting, because the Max can't use more than about half of its 400GB/s bandwidth (for CPU tasks, anyway). I suspect the Ultra exactly doubles that, meaning you get nowhere near 800GB/s on CPU tasks.
          The GPU on my Max is capable of moving around 330GB/s, so I expect the Ultra, with properly NUMA-aware data sets can at least double that. Which is fantastic, except, it's nothing compared to a good GPU.

          Ultimately, I've written my M1 Max off for computation as a neat toy that performs leagues past any other laptop, but is still woefully deficient when compared against a Desktop.
          • For high speed GPU compute, it's only comparable on a performance per watt metric. It get the same or better as the 30xx Nvidia chips but Nvidia still has more compute hardware available so it wins in total speed. You just also need a better cooling system and power system.

            • For high speed GPU compute, it's only comparable on a performance per watt metric. It get the same or better as the 30xx Nvidia chips

              Same, better, and worse, depending on the benchmark. It's fair to say that architecturally, they're similar. There's a caveat though, and I'll get to that at the end.

              It get the same or better as the 30xx Nvidia chips but Nvidia still has more compute hardware available so it wins in total speed.

              Yup.

              You just also need a better cooling system and power system.

              Absolutely. There's another thing to take into consideration.
              the NV cards are matching the M1 Max GPU in perf per watt, despite being on a significantly less efficient process node (Samsung 8nm, which is comparable, though slightly better, than TMSC 10nm, but barely over half of TSMC 7nm).
              A 5nm NV is going to walk the M1 GPU like a dog in

        • I do CFD simulations for a living. As always, the devil is in the details. DEM-CFD, i.e. Discrete Element Modeling of the 1st paper, is an algorithm that is ideal for parallelisation on a GPU. As long as you keep your data amount low, a GPU will outperform a CPU. Note that in the second paper, they did only 2D simulations of the classical CFD algorithms. That's cheating, if you ask me. In a real-life scenario the GPU would get overwhelmed by the data amount of a fine 3D mesh. Multiphase simulations also mak
      • by Jeremi ( 14640 )

        Macs traditionally aren't known as scientfic workstations nor is Apple likely to invest in that area.

        I don't know about that; I know a lot of scientists who prefer having a Mac on their desk to work on. Granted, they may be using the Mac largely as a GUI front-end to programs whose computations are performed on larger machines elsewhere, but isn't that what a workstation is? The station that you work at?

    • Test it for what you need it for

      Wise words. However, this raises the question as to why nobody on ./ has created benchmarks for shitposting. ;)

      • ... this raises the question ....

        Thank you for not writing "this begs the question".

        • Thank you for not writing "this begs the question".

          I've given up that fight. If most people decide that "begs the question" is another way to say "raises the question" then that's just the way it is. I've also accepted that "awesome" now can mean "oh, that's kinda nice."

    • by dbialac ( 320955 )
      This is not as significant of an achievement as they claim. Moore's law is in play here in the ensuing years.
    • What a long winded comment just to say something back handed.

    • The important point is that internal bandwidth is important for chiplets as an idea to work. Intel, AMD, ARM understand that. [engadget.com] Now Apple is on the bandwagon.

  • The article link doesn't automatically map to an HTTPS link. Am I the only one having this problem? Forcing the link in Firefox shows a domain certificate for pairsite.com (owned by pair.com).
  • M1 has better CPU performance scaling than super computers?

    What a load of absolute nonsense. That's for an extremely limited subset of "CPU performance scaling".

    For a single core (which for most super computers nowadays is a concept that doesn't really exist anymore) in combination with another one, I'm sure that's valid for some performance metrics.

    The article is.. Light on details, and heavy on buzzwords. He might have 25 years of experience, but doing benchmarking is obviously a hobby :D

    • Re: (Score:2, Insightful)

      You mean memory bandwidth limited workloads scale with increased caching and memory bandwidth?

      Who knew!

      CPUs have been memory bandwidth limited for over two decades now. It's like this guy is just cluing in.

      • CPUs have been memory bandwidth limited for over two decades now. It's like this guy is just cluing in.

        Or maybe, just maybe he's trying to clue in the rest of the world that does not realize that, and compares an intel laptop with an equal amount of memory as an M1 laptop as being the same...

    • M1 has better CPU performance scaling than super computers?

      What a load of absolute nonsense.

      I concur.

      Take the A64Fx for example. Also an ARM chip, but it has superior bandwidth using the pricey HBM2 instead of LPDDR. 1TB per second (and also didn't benchmarks show the CPU can't remotely saturate the bus for the M1 anyway...). Head to head, I'd put my money on a 48 core A64fx CPU versus an M1 for CFD.

      The only vaguely relevant thing is it "only" has 68GB/s off board bandwidth to other machines over the networ

  • by 140Mandak262Jamuna ( 970587 ) on Saturday April 30, 2022 @02:51PM (#62492430) Journal
    First let us get the basic facts about CFD benchmarks. But most probably it is a time marching code solving Navier Stokes equation ( or the Euler's equation dropping the viscous terms). These are the most popular codes used for parallel processing. The CFD- parallel processing connection is deep and has a very rich history going back to nearly the beginning of parallel computing. More specifically explicit time marching.

    The explicit time marching is simplicity itself and can be explained to any one with basic knowledge of physics. Divide the fluid domain into a large number of "control volumes". Satisfy the conservation of mass, momentum and energy in each control volume, by calculating the mass, momentum and energy flux (amount carried across the boundary of the control volume). As long as time step is so small, the sound wave from one "cell" does not cross the width of that cell, it is safe to use the values from previous time step for the quantities from the "other" cell.

    So each "cell" is really totally independent of ALL other cells in the domain other than the ones that are immediately adjacent to it, for each time step.

    The conservation equations are so simple they fit into the tiny cpu power of GPUs calculating the OpenGL Z depth. Very little memory and cpu power needed per cell. Extremely parallel and data independent. So the CS guys loved this equation and have been banging at it from day 1. I have listened to "computation vs communication" bottlenecks back when I was in college so long ago, 1979!

    An early form of GPU computation called Transputers go back to 1985! This "computation vs communication" has been researched for a long time, and the final conclusion is, for a given CFD problem, (Large Eddy Simulation, or flow past NACA0012 airfoil, or under the hood air flow of a car, or gas turbine or combusting flow, ...) for a given implementation, we can tune the computing nodes, memory etc and get very good scaling. But scaling of one problem usually does not translate into equally good scaling on other implementations or other problems.

    • by AmiMoJo ( 196126 )

      Looking at his benchmark results there are a few issues. Firstly he doesn't mention how much RAM the Xeon had, but it would doubtless be DDR4 if it is from 2019. The M1 that Apple lent him has 128GB, so may well have more memory channels than the Xeon did.

      Oh right, Apple lent him the Mac. I'm sure they expected this carefully selected task would produce good numbers for them.

      The M1 also has very large caches relative to other CPUs, necessary because ARM performance is heavily dependent on memory bandwidth.

    • What you are describing are FVM methods. They are very important for industrial applications, but as you mention FVM methods require very little communication, specially low spatial order schemes. If he wanted to showcase how the M1 solved the communication problem, he should've tried with a pseudospectral solver, which requires an all-to-all communication. Should showcase much better results than the FVM case.

      For USD 7,000, I'd like to see how this stacks up to a EPYC node of similar pricing, which ac
    • A very nice explanation.
  • Most software engineers working on parallel computing know that you should overlap communication and computation, and that if you do it right, communication overhead becomes irrelevant.

    Apparently he's not doing it, and his code is just bad?

    • This is so different in different problem domains that I don't think the generalization is either useful nor valid.

      But if you *can* overlap well, you absolutely should.

      (In others, you just have way more data that needs operating on than RAM or core, and if the CPU can keep up, there's not much you can do to make the problem better.)

    • Most software engineers working on parallel computing know that you should overlap communication and computation, and that if you do it right, communication overhead becomes irrelevant.

      That's not even close to what he's saying. He's saying when parallel computing starts scaling up, communications start becoming an issue. This is the main reason supercomputers use high speed interconnects instead like Infiniband between chips instead of 1 Gigabit Ethernet.

      • You clearly did not understand what I wrote.

        I guess Slashdot commenters aren't what they used to be.

        • And you do not clearly understand his problem based on what you wrote. Somehow magically software engineering should solve a fundamental problem that super computer scientists are still dealing with today.
          • There is nothing in what I wrote that is specific to his problem or that shows any misunderstanding.

            The guy claims the extra memory bandwidth made a huge difference. I'm arguing the code should be designed so that it runs I/O concurrently, and designed such that each CPU work is larger than each I/O work. Then memory bandwidth becomes irrelevant since the final runtime would be defined by CPU performance only.
            Now it's true that this is only possible if the arithmetic density of the compute is high enough, s

            • The guy claims the extra memory bandwidth made a huge difference. I'm arguing the code should be designed so that it runs I/O concurrently, and designed such that each CPU work is larger than each I/O work. Then memory bandwidth becomes irrelevant since the final runtime would be defined by CPU performance only.

              That's as idiotic as saying the way to solve traffic congestion is to leave earlier for your destination. It does nothing to actually solve the bottleneck issue. The problem AGAIN is that with more and more cores working on parallel computations, communication between cores either on the die or outside of the die becomes a bottleneck. Part of that communication issue is using memory more effectively. Increasing the memory bandwidth helps. Speeding up the memory helps.

              Now it's true that this is only possible if the arithmetic density of the compute is high enough, since otherwise I/O would take longer than whatever it is you do on the data.

              You seem not to understand parallel comp

    • First, this is usually somewhere between hard to impossible for most useful physical problems. You can't send the data for the currently computed timestep until it is, well, computed. Sometimes you are lucky enough that you can send some partial computation data ahead while other things are being computed. Second, even when you can overlap them, unless the computation is longer than the communication time per step, you can't hide it completely anyway. The cases where it is "irrelevant" are few and far betw
      • You adjust the size of your timestep such that you can send the previous one while you compute the next one (of course you can have multiple steps that unlock once you have a given step done).
        There are even tools that can automatically deduce the right sizing by running benchmarks and auto-tuning themselves.

        This is not difficult, I worked in labs where we achieved 90%+ scaling on petascale computers with those kinds of wavefront problems.

        The issue is that often the physicists working on this aren't teaming

        • We will overlook the fact the CFD is by-and-large not the same as wavefront algorithms... ...and that to "compute the next one" you need the previous one from your neighbor, unless you are using some complex implicit scheme - which none of the fastest codes do.

          Both classes of algorithms have timesteps that are determined by the physics. You don't adjust them to to optimize your communications.

          I am familiar with all the major public petascale codes of this sort, and several that are classified (there aren't

          • You can always adjust how much work you do per computing unit, it's at the core of all parallel algorithms.

            The fact that you need the adjacent cells to compute the next one is what makes it a wavefront problem, and isn't particularly unique.

            Just contact your local parallel computing lab to get help, there's lot of literature as well.

            • I _am_ at your local (actually national) computing lab, and I am where people go to get parallel programming help.

              You are oversimplifying what are complex problems, but if you want to name one of those codes you imply achieve this, I would be delighted to follow up. I may well have already contributed if it is really a petascale code - there aren't many.

              • Every single nation is but a local community on the world's stage.
                But anyway that level of self-importance signals you must be from the USA.

                I didn't personally work on testing our tooling on petascale problems, somebody else in the team did. Also isn't exascale the challenge that people have moved to nowadays?
                Myself I left academia and went for a different industry.

                • Well, although I wish it were otherwise, there are only two nations that play at the top level, the US and China. Both of which should shortly announce exascale machines - but no, we aren't there yet. I try not to come across as arrogant, but I also don't want to see completely incorrect information get propagated. Only one of us above was making assertions they weren't qualified to make.
  • "It's what makes parallel CFD computations such a great real world benchmark"

    No, it doesn't. You don't have even HALF the access patterns any other piece of software has. Give me a fucking break.

    Benchmarks are nothing more than advertising at the highest level of falsity.

  • Yay! I think of all the good times and wild parties I had with computational fluid dynamics. And the benchmarks! Oh boy...

Never ask two questions in a business letter. The reply will discuss the one you are least interested, and say nothing about the other.

Working...