On Thu, Jun 27, 2024 at 10:41:35AM +0800, kernel test robot wrote: > > > Hello, > > kernel test robot noticed a -33.7% regression of unixbench.throughput on: > > > commit: d042dae6ad74df8a00ee8a3c6b77b53bc9e32f64 ("lockref: speculatively spin waiting for the lock to be released") > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > [..] > | testcase: change | stress-ng: stress-ng.getdent.ops_per_sec -56.5% regression | [..] > | testcase: change | stress-ng: stress-ng.handle.ops_per_sec -60.2% regression | > | test machine | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory | [..] Both good and bad news is that this particular patch should be dropped (but struct dentry reordering should stay). There is some gnarly stuff here, please read the entire thing, even though it may seem rambly at times. To recap there can be a microbenchmark setting with continuous get/put traffic where only sporadically someone takes the lock. The constant traffic paired with immediate fallback to locking makes the primitives degrade to locked operation until benchmark is concluded. This is the problem the patch tried to address as with dentry reordering I found myself running into the degraded state by accident a lot. There was worry lkp machinery would do too, disfiguring their results. However, another microbenchmark setting can be mixing explicit spinlock acquire with lockref get/put. With high enough worker count the lock can be continuously held, making any spinning waiting for it to be released actively detrimental to performance. This is what the lkp folks ran into. So I got a 2 socket * 48 cores * 2 threads machine to play with. Both unixbench and stress-ng --getdent testcases are heavily mixed with read-write locking and mutexes, while stress-ng --handle is all about spinlocks and 1/3rd of it is lockref thus I only benchmarked the last one. At a *low* scale of 24 workers at my usual test box performance improves immensely (from ~165k ops/s to ~620k ops/s) as locking is avoided plenty of times. At 192 workers on the big box stock kernel achieves ~134k ops/s, while spinwait drops it to ~42k. As an experiment I added a dedicated waiting loop with configurable spin count. Doing merely one spin drops the rate to ~124k ops/s. I figured I'm going to write the smallest patch which avoids locked-only degradation and call it a day -- for example it is possible to check if the lock is contested thanks to the arch_spin_is_contended macro. Then the loop could be: if (lockref_locked(old)) { if (locked_contested(old)) break; cpu_relax(); old.lock_count = READ_ONCE(lockref->lock_count); continue; } Note how this avoids any additional accesses to the cacheline if the lock is under traffic. While this would avoid degrading in certain trivial cases, it very much can still result in lockref get/put being in the picture *and* only taking locks, so I'm not particularly happy with it. This is where I found that going to 60 workers issuing open or stat on the same dentry *also* permanently degrades, for example: # ./stat2_processes -t 60 testcase:Same file stat warmup min:11293 max:25058 total:1164504 min:10706 max:12977 total:679839 min:10303 max:12222 total:643437 min:8722 max:9846 total:542022 min:8514 max:9554 total:529731 min:8349 max:9296 total:519381 measurement min:8203 max:9079 total:510768 min:8122 max:8913 total:504069 According to perf top it's all contention on the spinlock and it is the lockref routines taking it. So something sneaks in a lock/unlock cycle and there is enough traffic that the default 100 spins with the patch are not sufficient to wait it out. With the configurable spin count I found this can stay lockless if I make it wait 1000 spins. Getting the spin count "wrong" further reduces performance -- for example with 700 spin limit it was not enough to wait out the lock holders and throughput went down to ~269k due to time wasted spinning, merely a little more than half of the degraded state above. That is to say messing with lockref itself is probably best avoided as getting a win depends on being at the right (low) scale or getting "lucky". My worry about locked-only degradation showing up was not justified in that for the scales used by lkp folks it already occures. All in all the thing to do is to find out who takes the spin lock and if it can be avoided. Perhaps this can be further split so that only spots depending on the refcount take the d_lock (for example in the --handle test the lock is taken to iterate the alias list -- it would not mind refs going up or down). It may be (and I'm really just handwaving here) dentries should move away from lockref altogether in favor of storing flags in the refcount. Even if both get/put would still have to cmpxchg they would never have to concern itself with locking of any sort, except when the dentry is going down. And probably the unref side could merely lock xadd into it. I recognize this would convert some regular ops in other parts of the kernel into atomics and it may be undesirable. Something to ponder nonetheless. At the end of the day: 1. this patch should be dropped, I don't think there is a good replacement either 2. the real fix would do something about the dentry lock itself or its consumers (maybe the lock is taken way more than it should?)