[ Adding linux-arch, which is relevant but not very specific, and the arm64 and powerpc maintainers that are the more specific cases for an architecture where this might actually matter. See https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@xxxxxxxxxxxxxx/ for original full email, but it might be sufficiently clear just from this heavily cut-down context too ] Side note on your access() changes - if it turns out that you can remove all the cred games, we should possibly then revert my old commit d7852fbd0f04 ("access: avoid the RCU grace period for the temporary subjective credentials") which avoided the biggest issue with the unnecessary cred switching. I *think* access() is the only user of that special 'non_rcu' thing, but it is possible that the whole 'non_rcu' thing ends up mattering for cases where the cred actually does change because euid != uid (ie suid programs), so this would need a bit more effort to do performance testing on. On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > To my understanding on said architecture failed cmpxchg still grants you > exclusive access to the cacheline, making immediate retry preferable > when trying to inc/dec unless a certain value is found. I actually suspect that is _always_ the case - this is not like a contended spinlock where we want to pause because we're waiting for the value to change and become unlocked, this cmpxchg loop is likely always better off just retrying with the new value. That said, the "likely always better off" is purely about performance. So I have this suspicion that the reason Tony added the cpu_relax() was simply not about performance, but about other issues, like fairness in SMT situations. That said, evern from a fairness perspective the cpu_relax() sounds a bit odd and unlikely - we're literally yielding when we lost a race, so it hurts the _loser_, not the winner, and thus might make fairness worse too. I dunno. Tony may have some memory of what the issue was. > ... without numbers attached to it. Given the above linked thread it > looks like the arch this was targeting was itanium, not x86-64, but > the change landed for everyone. Yeah, if it was ia64-only, it's a non-issue these days. It's dead and in pure maintenance mode from a kernel perspective (if even that). > Later it was further augmented with: > commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d > Author: Jan Glauber <jan.glauber@xxxxxxxxx> > Date: Wed Jun 5 15:48:49 2019 +0200 > > lockref: Limit number of cmpxchg loop retries > [snip] > With the retry limit the performance of an open-close testcase > improved between 60-70% on ThunderX2. > > While the benchmark was specifically on ThunderX2, the change once more > was made for all archs. Actually, in that case I did ask for the test to be run on x86 hardware too, and exactly like you found: > I should note in my tests the retry limit was never reached fwiw. the max loop retry number just isn't an issue. It fundamentally only affects extremely unfair platforms, so it's arguably always the right thing to do. So it may be "ThunderX2 specific" in that that is where it was noticed, but I think we can safely just consider the max loop thing to be a generic safety net that hopefully simply never triggers in practice on any sane platform. > All that said, I think the thing to do here is to replace cpu_relax > with a dedicated arch-dependent macro, akin to the following: I would actually prefer just removing it entirely and see if somebody else hollers. You have the numbers to prove it hurts on real hardware, and I don't think we have any numbers to the contrary. So I think it's better to trust the numbers and remove it as a failure, than say "let's just remove it on x86-64 and leave everybody else with the potentially broken code" Because I do think that a cmpxchg loop that updates the value it compares and exchanges is fundamentally different from a "busy-loop, trying to read while locked", and with your numbers as ammunition, I think it's better to just remove that cpu_relax() entirely. Then other architectures can try to run their numbers, and only *if* it then turns out that they have a reason to do something else should we make this conditional and different on different architectures. Let's try to keep the code as common as possibly until we have hard evidence for special cases, in other words. Linus