A 20 Year Old Chipset Workaround Has Been Hurting Modern AMD Linux Systems (phoronix.com) 53
AMD engineer K Prateek Nayak recently uncovered that a 20 year old chipset workaround in the Linux kernel still being applied to modern AMD systems is responsible in some cases for hurting performance on modern Zen hardware. Fortunately, a fix is on the way for limiting that workaround to old systems and in turn helping with performance for modern systems. Phoronix reports: Last week was a patch posted for the ACPI processor idle code to avoid an old chipset workaround on modern AMD Zen systems. Since ACPI support was added to the Linux kernel in 2002, there has been a "dummy wait op" to deal with some chipsets where STPCLK# doesn't get asserted in time. The dummy I/O read delays further instruction processing until the CPU is fully stopped. This was a problem with at least some AMD Athlon era systems with a VIA chipset... But not a problem with newer chipsets of roughly the past two decades.
With this workaround still being applied to even modern AMD systems, K Prateek Nayak discovered: "Sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases. One such workload is tbench where a massive performance degradation can be observed during certain runs."
At least for Tbench, this long-time, unconditional workaround in the Linux kernel has been hurting AMD Ryzen / Threadripper / EPYC performance in select workloads. This workaround hasn't affected modern Intel systems since those newer Intel platforms use the alternative MWAIT-based intel_idle driver code path instead. The AMD patch evolved into this patch by Intel Linux engineer Dave Hansen. That patch to limit the "dummy wait" workaround to old systems is already queued into TIP's x86/urgent branch. With it going the route of "x86/urgent" and for fixing a overzealous workaround that isn't needed on modern hardware, it's likely this patch will be submitted this week still for the Linux 6.0 kernel rather than needing to wait until the next (v6.1) merge window.
With this workaround still being applied to even modern AMD systems, K Prateek Nayak discovered: "Sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases. One such workload is tbench where a massive performance degradation can be observed during certain runs."
At least for Tbench, this long-time, unconditional workaround in the Linux kernel has been hurting AMD Ryzen / Threadripper / EPYC performance in select workloads. This workaround hasn't affected modern Intel systems since those newer Intel platforms use the alternative MWAIT-based intel_idle driver code path instead. The AMD patch evolved into this patch by Intel Linux engineer Dave Hansen. That patch to limit the "dummy wait" workaround to old systems is already queued into TIP's x86/urgent branch. With it going the route of "x86/urgent" and for fixing a overzealous workaround that isn't needed on modern hardware, it's likely this patch will be submitted this week still for the Linux 6.0 kernel rather than needing to wait until the next (v6.1) merge window.
Re:This won't affect me. (Score:5, Funny)
Thanks for clearing that up. Can I get your name, email and address so we can add you to the list of people unaffected?
Re: (Score:2)
Athlon started at 500Mhz, P3 at 400.
150 MHz sounds like an overclocked K5, 166 MHz would be an original Pentium or a Pentium MMX.
Re: (Score:2)
No wait - these were the days of clocking down to make things work at all...oh wait maybe it was in the 90s...can't remember anymore ;-)
Re: (Score:2)
My first system back in the day was a K7 Slot A Athlon (first Athlons were all slot format processors rather than socket, you slotted them in like you did extension cards, except vertically rather than horizontally). I am quite certain that there was no way to arrange jumpers to get significantly below 500MHz on my motherboard back in the day, and I would wager that others wouldn't let you do it either. Back then frequency was set via two motherboard jumpers. First would determine the multiplier and second
Re: This won't affect me. (Score:2)
Ahh, you've just reminded me of the 486/Pentium era beige box cases that displayed the clock freq in LEDs on the front of the box.
And overclocking was as simple as toggling a switch or three on your mobo. No locked CPUs. 32MB RAM was beefy and there was no way known (until MPEG audio and video came along) of filling a 1.6GB HDD.
Re: (Score:2)
"and there was no way known (until MPEG audio and video came along) of filling a 1.6GB HDD." I'm sorry ,what? There was a truism already back in the early 90's: "When you work with graphics, music or video, there are 3 facts of life: You never have enough RAM, you'll always need more storage, and there's no such thing as "CPU is fast enough". And that is fairly true today too. As soon as you start going beyond simple editing, your system requirements increase rapidly.
Re: This won't affect me. (Score:2)
I can remember a colleague saying at the time, "How you gonna fill that drive?"
Yes, obviously we could have been running an NNTP server with 50,000 groups including all the binaries but I wasn't talking production or even workstation level computing. I gave the specs and use case and even mentioned a generic beige bo
Re: (Score:2)
Re: This won't affect me. (Score:1)
Sounds like the very beginning of the speculative execution nonsense. When intel and and were letting developers access processor instructions like reads/writes to L1, L2 for their own hardware, software, proposed subprocessors.
Re: (Score:3, Interesting)
un-VIA-ble (Score:4, Insightful)
Re: (Score:2)
VIA had a very well earned HORRIBLE reputation with their motherboard chipsets back in the era between pentium 2 and pentium 3.
Re: (Score:2)
They were horrible well into the AthlonXP era, until they eventually fizzled out.
Re: (Score:2)
I think many people here made that mistake at dinner point in their life. I remember VIA made my following purchase an offer priced first party Intel reference motherboard, that's how much it soured me experience.
Re: (Score:2)
*over priced. Dan autocorrect.
Re: (Score:2)
Re: (Score:2)
Swipe typing on a phone most likely. Those two words would have a similar pattern.
Re: (Score:2)
A lot of VIA's problems (and AMD's by extension, since VIA made most of the chipsets back then before AMD started making their own) was that they would get their clock rates by overclocking the PCI bus, which caused all kinds of shit to go sideways. Usually the fix for these systems was to get the PCI bus back into spec by clocking at 33Mhz, which got your AGP back to 66Mhz since it was a hard 2x clock over PCI, but it meant underclocking the CPU from whatever the marketing guys sold you.
Now motherboards h
Re: (Score:2)
Intel PCI buses were also more reliable. If you owned a Creative Sound Blaster with a PCI interface, plugging it into an AMD or VIA chipset computer was a great way to c
Only Zen chips? (Score:3, Insightful)
There's been a lot of chipset releases between Zen and those old VIA chipsets... this is probably affecting a lot of other hardware in the wild unnecessarily too, isn't it?
yes and no (Score:2)
It's my understanding it does effect the in-between.
Zen is 5 years old. Are there a LOT of AMD CPUs running heavy workloads that are more than 5 years old? I guess it depends on your definition of "a lot".
Backport to older kernels? (Score:5, Interesting)
I hope this gets backported to older, but still supported, 4 and 5 kernels.
Also, someone please fix the SATA NCQ brokenness with EPYC and Samsung SSD's...
Re:Backport to older kernels? (Score:4, Interesting)
It'd likely be up to maintainers of older kernels to decide if they want to port it or not. If the performance is significant, theres a good chance Redhat/etc will pull the patch anyway.
Also, upgrade your kernels :)
re the SATA thing, have you, or someone else, got a bug report into the SATA maintainer, whoever that is?
Re: (Score:2)
Having seen the drama from people upgrading 4 to 5 kernels and no longer having supported older or suddenly broken hardware support, i'll stick with the 4 kernel for as long as i can, on my older hardware.
The EPYC/samsung thing has been in limbo since about 2018 and neither AMD nor samsung want to take responsibility/fix it. Turning off NCQ altogether is the only thing that makes it go away, but makes the drive much slower.
Re:Backport to older kernels? (Score:4, Interesting)
I know from experience the SATA NCQ Samsung SSDs issue is not limited to EPYC but even on old hardware such as AMD Athlon(tm) II X2 250 Processor and motherboard with a Marvell 9215 SATA controller.
Which in my opinion rules out host hardware but more than likely a problem with the SATA driver and/or Samsung SSD controllers.
Re: (Score:2)
Gosh, I might even do this by hand if Debian doesn't. It's literally just:
processor_idle.c:
- if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || boot_cpu_has(X86_FEATURE_ZEN))
The energy reduction across every Zen Debian machine could be huge.
Re:another worthless contribution to global warmin (Score:5, Funny)
And then you decided to make it worse by typing out this comment and hitting send. Have you no shame?
News For Nerds, Indeed (Score:2, Funny)
Can I say this was back to core values?
Re: (Score:2)
An anecdote is not data. I wouldn't celebrate yet.
not a programmer (Score:2)
but why don't they just not call it when its not needed? They can't detect what processor and chipset the computer is running and if it doesn't need the dummy call, dont call it? And if it does then call it.. Is this somehow possible without compiling
Re: (Score:1)
Yea, it's probably a simple check they didn't bother making specific enough, but I suspect it's the type of thing that would be a 1-line patch that requires a full kernel rebuild.
observation (Score:2)
"Sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases."
Doesn't the idle system seem overly complex?
Re: (Score:3)
You're saying that the idle system is particularly busy?
Kudos (Score:5, Informative)
to K Prateek Nayak, the AMD engineer who discovered the problem. This type of issues is incredibly hard to troubleshoot, particularly since they can be hardly replicated by simulation, you need to test them live in the actual hardware.
Re: (Score:1)
The (mis)behavior isn't caused by hardware, it's caused 100% by software.
Re: (Score:3)
This particular issue is pretty much impossible to replicate by simulation, since the behavior is a combination of software response and hardware response to the faulty software.
Re: (Score:2)
There is no hardware response.
That's why the issue is so easily fixable.
The C state idle heuristics are chosen by the idle driver in use by AMD cpus.
That is *not* automatic behavior by the CPU.
The C state heuristics in the idle driver were being misled by the imposed wait.
In short, you have no idea what you're talking about.
Re: (Score:2)
On ACPI systems, the idle state is controlled thusly:
The kernel asks the firmware (ACPI) to put the processor in a particular idle state, based on its polling of the time in each idle state.
The processor_idle driver (the basic generic x86 ACPI idle driver) had a wasted ACPI call designed to was
Re: (Score:2)
So sure, from a certain point of view, you're right- it required hardware.
Which was exactly what I claimed. It was not possible to emulate ones way to this issue (by which I do not mean running a VM, but an emulator). And boy, did that cause your cheerios to get a pissy taste. Glad I could affect your day to that extent.
Re: (Score:2)
This particular issue is pretty much impossible to replicate by simulation
That is patently false.
I have explained how.
If you disagree with that, you are wrong.
It was not possible to emulate ones way to this issue (by which I do not mean running a VM, but an emulator).
Flat ass backwards.
This behavior exists in any ACPI firmware you pair with qemu (because they behave correctly).
This issue is not dependent on hardware. It is a misbehavior of the processor_idle driver paired with the ACPI firmware.
This behavior does not exist in a VM, but only because the kernel purposefully skips the extraneous call if it's running with a supported hypervisor.
And boy, did that cause your cheerios to get a pissy taste. Glad I could affect your day to that extent.
Being wrong sucks. I get it. But
Re: (Score:2)
And yes, it is pretty much impossible to replicate. It took careful troubleshooting of specific benchmarks to even notice it was happening, and then a lot of deep diving with profiling to locate it.
Sure, once you know exactly what is happening, all you need is the code and running it through your head to understand what is going on. But that's not even remotely a feasible way to find out a problem like this even exists, much less pinpoint it.
Your argument is basically "when you know what is happening, the f
Re: (Score:2)
Your argument is basically "when you know what is happening, the fault is trivial to replicate", which is trivially correct, I'll give you that.
Correct.
I argued merely that it's easy to simulate, not that it was easy to run into.
Finding the culprit is also not that difficult.
Querying the C state residency immediately points you in the right direction.
The numbers are nonsense in "vicious cycle mode".
This points very directly at the idle driver, since it's responsible for selecting the C states.
Figuring out what is *wrong* with the idle driver is standard debugging, and that's where simulation comes in. This can be easily simulated with qemu,
Improvement (Score:5, Informative)
Fixed by an Intel engineer? (Score:3)
Well now we know the name of an Intel engineer that got yelled at today. ;)