Hello, [cc is probably woefully incomplete, just grabbed people from lockref history; should this land on a x86 list instead of vfs?] I intended to send a patch which fixes cred-related bottleneck in access(), and while getting the expected win for calls with different files, I got a *slowdown* when benchmarking against the same file and according to perf top it was all lockref. I'm going to post it after this issue is resolved, interested parties can take a peek here: https://dpaste.com/8SVDF8HJH . The problem is visible with open3 test ("Same file open/close") from will-it-scale. I ran the _processes variant against stock + no-pause kernel on Cascade Lake (2 sockets * 24 cores * 2 threads) running 6.2-rc3. Results are: proc stock no-pause 1 805603 814942 2 1054980 1054781 8 1544802 1822858 24 1191064 2199665 48 851582 1469860 96 609481 1427170 As you can see degradation already shows up at ~8 workers. While trying to do my homework regarding history in the area I found this thread: https://lkml.iu.edu/hypermail/linux/kernel/1309.0/02330.html It mentions a stat-based test, which I presume was multithreaded stat on the same dentry. will-it-scale somehow does not have stat benches, so I posted a PR to add some: https://github.com/antonblanchard/will-it-scale/pull/35/files With fstat2_processes (Same file fstat) I get: proc stock no-pause 1 3013872 3047636 2 4284687 4400421 8 3257721 5530156 24 2239819 5466127 48 1701072 5256609 96 1269157 6649326 To my understanding on said architecture failed cmpxchg still grants you exclusive access to the cacheline, making immediate retry preferable when trying to inc/dec unless a certain value is found. By doing pause instead one not only induces a delay, but also increases likelihood that the line will have to be grabbed E again. Something to that extent was even stated in thread and it definitely lines up with results above. I see pause first shoed up first here: commit d472d9d98b463dd7a04f2bcdeafe4261686ce6ab Author: Tony Luck <tony.luck@xxxxxxxxx> Date: Tue Sep 3 14:49:49 2013 -0700 lockref: Relax in cmpxchg loop ... without numbers attached to it. Given the above linked thread it looks like the arch this was targeting was itanium, not x86-64, but the change landed for everyone. Later it was further augmented with: commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d Author: Jan Glauber <jan.glauber@xxxxxxxxx> Date: Wed Jun 5 15:48:49 2019 +0200 lockref: Limit number of cmpxchg loop retries [snip] With the retry limit the performance of an open-close testcase improved between 60-70% on ThunderX2. While the benchmark was specifically on ThunderX2, the change once more was made for all archs. I should note in my tests the retry limit was never reached fwiw. That aside, the open-close testcase mentioned should match open3. All that said, I think the thing to do here is to replace cpu_relax with a dedicated arch-dependent macro, akin to the following: diff --git a/lib/lockref.c b/lib/lockref.c index 45e93ece8ba0..e057e1630e7c 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -2,6 +2,10 @@ #include <linux/export.h> #include <linux/lockref.h> +#ifndef arch_cpu_relax_cmpxchg_loop +#define arch_cpu_relax_cmpxchg_loop() cpu_relax() +#endif + #if USE_CMPXCHG_LOCKREF /* @@ -23,7 +27,7 @@ } \ if (!--retry) \ break; \ - cpu_relax(); \ + arch_cpu_relax_cmpxchg_loop(); \ } \ } while (0) Then x86-64 would simply: +#define arch_cpu_relax_cmpxchg_loop do { } while (0) Not an actual patch and I don't care about the name, just illustrating what I mean. I have to note there are probably numerous other cmpxchg loops without the pause/fallback treatment, quick grep reveals one in refcount_dec_not_one, if the fallback and/or cpu_relax thing is indeed desirable the other loops should probably get augmented to have it. Any comments? If you agree with the idea I'll submit a proper patch. -- Mateusz Guzik <mjguzik gmail.com>