Mac Studio's M1 Ultra Chip Outperforms on Computational Fluid Dynamics Benchmarks (hrtapps.com) 63
Dr. Craig Hunter is a mechanical/aerospace engineer with over 25 years of experience in software development. And now Dixie_Flatline (Slashdot reader #5,077) describes Hunter's latest experiment:
Craig Hunter has been running Computational Fluid Dynamics (CFD) benchmarks on Macs for years--he has results going back to 2010 with an Intel Xeon 5650, with the most recent being a 28-core Xeon W from 2019. He has this to say about why he thinks CFD benchmarks are a good test: "As shown above, we see a pretty typical trend where machines get less and less efficient as more and more cores join the computation. This happens because the computational work begins to saturate communications on the system as data and MPI instructions pass between the cores and memory, creating overhead. It's what makes parallel CFD computations such a great real world benchmark. Unlike simpler benchmarks that tend to make CPUs look good, the CFD benchmark stresses the entire system and shows us how things hold up as conditions become more and more challenging."
With just 6 cores, the Mac Studio's M1 Ultra surpasses the 2019 Xeon before literally going off the original chart. He had to double the x-axis just to fit the M1's performance in. Unsurprisingly, he seems impressed:
"We know from Apple's specs and marketing materials that the M1 Ultra has an extremely high 800 GB/sec memory bandwidth and an even faster 2.5 TB/sec interface between the two M1 Max chips that make up the M1 Ultra, and it shows in the CFD benchmark. This leads to a level of CPU performance scaling that I don't even see on supercomputers."
With just 6 cores, the Mac Studio's M1 Ultra surpasses the 2019 Xeon before literally going off the original chart. He had to double the x-axis just to fit the M1's performance in. Unsurprisingly, he seems impressed:
"We know from Apple's specs and marketing materials that the M1 Ultra has an extremely high 800 GB/sec memory bandwidth and an even faster 2.5 TB/sec interface between the two M1 Max chips that make up the M1 Ultra, and it shows in the CFD benchmark. This leads to a level of CPU performance scaling that I don't even see on supercomputers."
Re: (Score:2, Insightful)
Re: Test it for what you need it for (Score:4, Funny)
a few hours.
For pr0n? That doesnt sound right 30 mins max should be enough for anyone.
Re: (Score:1)
I seem to recall a stat from my work in digital television (before internet video was so common) that the average length of a hotel porn movie rental -- that is, how long someone actually watched it -- was nine minutes.
Re: (Score:2)
real world applications.
pr0n is good.
i do not need an m 1 to watch pr0n.
i do need something better for.
call of duty.
apple knows this.
they just do not how to solve this
Re: (Score:1)
There is a little blue pill to assist with that.
Or
From a skit on SNL a few years (or decades) ago. Dr. Porkenheimer's Boner Juice
https://www.nbc.com/saturday-n... [nbc.com]
"If you experience an erection lasting longer than twenty-four hours, call up your friends and brag about it."
Re: (Score:2)
So basically, you need a laptop with the most PPS*.
* porns per second.
yes (Score:2)
or...
"It's what makes parallel CFD computations such a great real world benchmark. Unlike simpler benchmarks that tend to make CPUs look good, the CFD benchmark stresses the entire system and shows us how things hold up as conditions become more and more challenging."
It's what makes parallel CFD computations such a great real world benchmark IF THAT"S WHAT YOUR APPLICATION DOES. Otherwise, simpler benchmarks may be more indicative.
Also, this Apple solution may be good if you need to do CFD ON A DESKTOP. I
Re:yes (Score:5, Interesting)
It's kind of hard to buy what he's saying about supercomputers, because he's only claiming 800GB/sec memory bandwidth while modern compute GPUs deliver twice that or more, and you can do CFD on them. Are people who really need maximum performance still even doing CFD on CPUs? I'm seeing claims that single high-end compute-specific GPUs with HBM are handily outperforming CPUs with many cores, or even whole clusters of PCs doing CPU-based CFD. For example, Performance comparison of CFD-DEM solver MFiX-Exa, on GPUs and CPUs [arxiv.org] , e.g.
Or perhaps GPU Acceleration of CFD Algorithm: HSMAC and SIMPLE [sciencedirect.com] ...
It seems like CFD on CPU is irrelevant ATM. But maybe someone who does it for a living will tell me why I'm wrong :)
Re:yes (Score:4, Interesting)
It seems like CFD on CPU is irrelevant ATM.
Fujitsu might like a word with you.
On the other hand those aren't exactly run of the mill CPUs. Very wide vector units, HBM2 memory and 68GB/s of networking.
Re:yes (Score:5, Informative)
I have begun doing these on my MacBook Pro (M1 Max, 10 core (8P+2E), 64GB) just to see how it does.
CPU wise, the Max is pretty fucking good. It lays pretty solid waste to my desktop counterparts.
GPU is another story, though.
I just upgraded from a 2080 Ti to a 3090 Ti (Because it's the first fucking graphics card I've seen for sale at MSRP in years, now)
Still waiting on a power supply beefy enough to run int.
My 2080Ti outperformed my M1 Max in computational tasks by around 400%. I expect the 3090 Ti will approximately double to triple that.
The author describes the memory bandwidth, which I find interesting, because the Max can't use more than about half of its 400GB/s bandwidth (for CPU tasks, anyway). I suspect the Ultra exactly doubles that, meaning you get nowhere near 800GB/s on CPU tasks.
The GPU on my Max is capable of moving around 330GB/s, so I expect the Ultra, with properly NUMA-aware data sets can at least double that. Which is fantastic, except, it's nothing compared to a good GPU.
Ultimately, I've written my M1 Max off for computation as a neat toy that performs leagues past any other laptop, but is still woefully deficient when compared against a Desktop.
Re: (Score:1)
For high speed GPU compute, it's only comparable on a performance per watt metric. It get the same or better as the 30xx Nvidia chips but Nvidia still has more compute hardware available so it wins in total speed. You just also need a better cooling system and power system.
Re: (Score:2)
For high speed GPU compute, it's only comparable on a performance per watt metric. It get the same or better as the 30xx Nvidia chips
Same, better, and worse, depending on the benchmark. It's fair to say that architecturally, they're similar. There's a caveat though, and I'll get to that at the end.
It get the same or better as the 30xx Nvidia chips but Nvidia still has more compute hardware available so it wins in total speed.
Yup.
You just also need a better cooling system and power system.
Absolutely. There's another thing to take into consideration.
the NV cards are matching the M1 Max GPU in perf per watt, despite being on a significantly less efficient process node (Samsung 8nm, which is comparable, though slightly better, than TMSC 10nm, but barely over half of TSMC 7nm).
A 5nm NV is going to walk the M1 GPU like a dog in
Re: (Score:3)
Re: (Score:2)
Macs traditionally aren't known as scientfic workstations nor is Apple likely to invest in that area.
I don't know about that; I know a lot of scientists who prefer having a Mac on their desk to work on. Granted, they may be using the Mac largely as a GUI front-end to programs whose computations are performed on larger machines elsewhere, but isn't that what a workstation is? The station that you work at?
Re: (Score:3)
The 5800X3D is a neat gambit, but it doesn't always pan out.
If you can truly utilize all the cache, then its performance is stellar. If you can't, then it's sub-par, and you're better off getting a 12900K.
So far, it's hit and miss. Of course it does give you a target to optimize for. But it's by no means "the best" in an unqualified sense.
Re: (Score:3)
Test it for what you need it for
Wise words. However, this raises the question as to why nobody on ./ has created benchmarks for shitposting. ;)
Re: (Score:2)
... this raises the question ....
Thank you for not writing "this begs the question".
Re: (Score:1)
Thank you for not writing "this begs the question".
I've given up that fight. If most people decide that "begs the question" is another way to say "raises the question" then that's just the way it is. I've also accepted that "awesome" now can mean "oh, that's kinda nice."
Re: (Score:2)
Re: (Score:3)
What a long winded comment just to say something back handed.
Re: (Score:2)
The important point is that internal bandwidth is important for chiplets as an idea to work. Intel, AMD, ARM understand that. [engadget.com] Now Apple is on the bandwagon.
OT: Link doesn't redirect to https (Score:2)
Re: (Score:2)
Similar thing here with Chrome if I try to force https on the URL.
Re: (Score:3)
CPU performance scaling? (Score:2)
M1 has better CPU performance scaling than super computers?
What a load of absolute nonsense. That's for an extremely limited subset of "CPU performance scaling".
For a single core (which for most super computers nowadays is a concept that doesn't really exist anymore) in combination with another one, I'm sure that's valid for some performance metrics.
The article is.. Light on details, and heavy on buzzwords. He might have 25 years of experience, but doing benchmarking is obviously a hobby :D
Re: (Score:2, Insightful)
Who knew!
CPUs have been memory bandwidth limited for over two decades now. It's like this guy is just cluing in.
Other way around (Score:2)
CPUs have been memory bandwidth limited for over two decades now. It's like this guy is just cluing in.
Or maybe, just maybe he's trying to clue in the rest of the world that does not realize that, and compares an intel laptop with an equal amount of memory as an M1 laptop as being the same...
Re: Other way around (Score:2)
Re: (Score:3)
Since I'm not gracing this idiot with a page hit, I'm going to guess his other Intel CPUs either had two or four channels, with all of them having a lower bus rate and transaction throughput.
You don't have to guess and could look up the Intel specs. The only 28 core Intel Xeon from 2019 [intel.com] has six channels. I believe 6 channels was the maximum Intel used at the time. So you guessed wrong.
Re: (Score:2)
M1 has better CPU performance scaling than super computers?
What a load of absolute nonsense.
I concur.
Take the A64Fx for example. Also an ARM chip, but it has superior bandwidth using the pricey HBM2 instead of LPDDR. 1TB per second (and also didn't benchmarks show the CPU can't remotely saturate the bus for the M1 anyway...). Head to head, I'd put my money on a 48 core A64fx CPU versus an M1 for CFD.
The only vaguely relevant thing is it "only" has 68GB/s off board bandwidth to other machines over the networ
It shows how *His* code scales. (Score:4, Informative)
The explicit time marching is simplicity itself and can be explained to any one with basic knowledge of physics. Divide the fluid domain into a large number of "control volumes". Satisfy the conservation of mass, momentum and energy in each control volume, by calculating the mass, momentum and energy flux (amount carried across the boundary of the control volume). As long as time step is so small, the sound wave from one "cell" does not cross the width of that cell, it is safe to use the values from previous time step for the quantities from the "other" cell.
So each "cell" is really totally independent of ALL other cells in the domain other than the ones that are immediately adjacent to it, for each time step.
The conservation equations are so simple they fit into the tiny cpu power of GPUs calculating the OpenGL Z depth. Very little memory and cpu power needed per cell. Extremely parallel and data independent. So the CS guys loved this equation and have been banging at it from day 1. I have listened to "computation vs communication" bottlenecks back when I was in college so long ago, 1979!
An early form of GPU computation called Transputers go back to 1985! This "computation vs communication" has been researched for a long time, and the final conclusion is, for a given CFD problem, (Large Eddy Simulation, or flow past NACA0012 airfoil, or under the hood air flow of a car, or gas turbine or combusting flow, ...) for a given implementation, we can tune the computing nodes, memory etc and get very good scaling. But scaling of one problem usually does not translate into equally good scaling on other implementations or other problems.
Re: (Score:2)
Looking at his benchmark results there are a few issues. Firstly he doesn't mention how much RAM the Xeon had, but it would doubtless be DDR4 if it is from 2019. The M1 that Apple lent him has 128GB, so may well have more memory channels than the Xeon did.
Oh right, Apple lent him the Mac. I'm sure they expected this carefully selected task would produce good numbers for them.
The M1 also has very large caches relative to other CPUs, necessary because ARM performance is heavily dependent on memory bandwidth.
Re: (Score:1)
For USD 7,000, I'd like to see how this stacks up to a EPYC node of similar pricing, which ac
Re: It shows how *His* code scales. (Score:2)
Overlap communication and computation (Score:2)
Most software engineers working on parallel computing know that you should overlap communication and computation, and that if you do it right, communication overhead becomes irrelevant.
Apparently he's not doing it, and his code is just bad?
Re: (Score:1)
This is so different in different problem domains that I don't think the generalization is either useful nor valid.
But if you *can* overlap well, you absolutely should.
(In others, you just have way more data that needs operating on than RAM or core, and if the CPU can keep up, there's not much you can do to make the problem better.)
Re: (Score:1)
(I meant cache, not core, obviously.)
Re: (Score:2)
Most software engineers working on parallel computing know that you should overlap communication and computation, and that if you do it right, communication overhead becomes irrelevant.
That's not even close to what he's saying. He's saying when parallel computing starts scaling up, communications start becoming an issue. This is the main reason supercomputers use high speed interconnects instead like Infiniband between chips instead of 1 Gigabit Ethernet.
Re: Overlap communication and computation (Score:2)
You clearly did not understand what I wrote.
I guess Slashdot commenters aren't what they used to be.
Re: (Score:2)
Re: Overlap communication and computation (Score:2)
There is nothing in what I wrote that is specific to his problem or that shows any misunderstanding.
The guy claims the extra memory bandwidth made a huge difference. I'm arguing the code should be designed so that it runs I/O concurrently, and designed such that each CPU work is larger than each I/O work. Then memory bandwidth becomes irrelevant since the final runtime would be defined by CPU performance only.
Now it's true that this is only possible if the arithmetic density of the compute is high enough, s
Re: (Score:2)
The guy claims the extra memory bandwidth made a huge difference. I'm arguing the code should be designed so that it runs I/O concurrently, and designed such that each CPU work is larger than each I/O work. Then memory bandwidth becomes irrelevant since the final runtime would be defined by CPU performance only.
That's as idiotic as saying the way to solve traffic congestion is to leave earlier for your destination. It does nothing to actually solve the bottleneck issue. The problem AGAIN is that with more and more cores working on parallel computations, communication between cores either on the die or outside of the die becomes a bottleneck. Part of that communication issue is using memory more effectively. Increasing the memory bandwidth helps. Speeding up the memory helps.
Now it's true that this is only possible if the arithmetic density of the compute is high enough, since otherwise I/O would take longer than whatever it is you do on the data.
You seem not to understand parallel comp
Re: Overlap communication and computation (Score:2)
This is nonsense. The approch offers scaling in the 90+% range on thousands of cores on supercomputers.
I was a research engineer in academia working on parallelizing such problems. Just read the publications.
Re: (Score:1)
Re: Overlap communication and computation (Score:2)
You adjust the size of your timestep such that you can send the previous one while you compute the next one (of course you can have multiple steps that unlock once you have a given step done).
There are even tools that can automatically deduce the right sizing by running benchmarks and auto-tuning themselves.
This is not difficult, I worked in labs where we achieved 90%+ scaling on petascale computers with those kinds of wavefront problems.
The issue is that often the physicists working on this aren't teaming
Re: (Score:1)
We will overlook the fact the CFD is by-and-large not the same as wavefront algorithms... ...and that to "compute the next one" you need the previous one from your neighbor, unless you are using some complex implicit scheme - which none of the fastest codes do.
Both classes of algorithms have timesteps that are determined by the physics. You don't adjust them to to optimize your communications.
I am familiar with all the major public petascale codes of this sort, and several that are classified (there aren't
Re: Overlap communication and computation (Score:2)
You can always adjust how much work you do per computing unit, it's at the core of all parallel algorithms.
The fact that you need the adjacent cells to compute the next one is what makes it a wavefront problem, and isn't particularly unique.
Just contact your local parallel computing lab to get help, there's lot of literature as well.
Re: (Score:1)
I _am_ at your local (actually national) computing lab, and I am where people go to get parallel programming help.
You are oversimplifying what are complex problems, but if you want to name one of those codes you imply achieve this, I would be delighted to follow up. I may well have already contributed if it is really a petascale code - there aren't many.
Re: Overlap communication and computation (Score:2)
Every single nation is but a local community on the world's stage.
But anyway that level of self-importance signals you must be from the USA.
I didn't personally work on testing our tooling on petascale problems, somebody else in the team did. Also isn't exascale the challenge that people have moved to nowadays?
Myself I left academia and went for a different industry.
Re: (Score:1)
Benchmarks are fucking gamed (Score:2)
"It's what makes parallel CFD computations such a great real world benchmark"
No, it doesn't. You don't have even HALF the access patterns any other piece of software has. Give me a fucking break.
Benchmarks are nothing more than advertising at the highest level of falsity.
Woohoo! (Score:2)
Re: (Score:2)
Did you make your user name just for this post?
Re: (Score:2)
Re: (Score:2)
Hehe :P