On Tue, Jul 02, 2024 at 09:19:10AM +0200, Mateusz Guzik wrote: > On Thu, Jun 27, 2024 at 10:41:35AM +0800, kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed a -33.7% regression of unixbench.throughput on: > > > > > > commit: d042dae6ad74df8a00ee8a3c6b77b53bc9e32f64 ("lockref: speculatively spin waiting for the lock to be released") > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > > > [..] > > | testcase: change | stress-ng: stress-ng.getdent.ops_per_sec -56.5% regression | > [..] > > | testcase: change | stress-ng: stress-ng.handle.ops_per_sec -60.2% regression | > > | test machine | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory | > [..] > > Both good and bad news is that this particular patch should be dropped > (but struct dentry reordering should stay). There is some gnarly stuff > here, please read the entire thing, even though it may seem rambly at > times. > > To recap there can be a microbenchmark setting with continuous get/put > traffic where only sporadically someone takes the lock. The constant > traffic paired with immediate fallback to locking makes the primitives > degrade to locked operation until benchmark is concluded. This is the > problem the patch tried to address as with dentry reordering I found > myself running into the degraded state by accident a lot. There was > worry lkp machinery would do too, disfiguring their results. > > However, another microbenchmark setting can be mixing explicit spinlock > acquire with lockref get/put. With high enough worker count the lock can > be continuously held, making any spinning waiting for it to be released > actively detrimental to performance. This is what the lkp folks ran > into. > > So I got a 2 socket * 48 cores * 2 threads machine to play with. > > Both unixbench and stress-ng --getdent testcases are heavily mixed with > read-write locking and mutexes, while stress-ng --handle is all about > spinlocks and 1/3rd of it is lockref thus I only benchmarked the last > one. > > At a *low* scale of 24 workers at my usual test box performance improves > immensely (from ~165k ops/s to ~620k ops/s) as locking is avoided plenty > of times. > > At 192 workers on the big box stock kernel achieves ~134k ops/s, while > spinwait drops it to ~42k. As an experiment I added a dedicated waiting > loop with configurable spin count. Doing merely one spin drops the rate > to ~124k ops/s. > > I figured I'm going to write the smallest patch which avoids locked-only > degradation and call it a day -- for example it is possible to check if > the lock is contested thanks to the arch_spin_is_contended macro. > > Then the loop could be: > if (lockref_locked(old)) { > if (locked_contested(old)) > break; > cpu_relax(); > old.lock_count = READ_ONCE(lockref->lock_count); > continue; > } > > > Note how this avoids any additional accesses to the cacheline if the > lock is under traffic. While this would avoid degrading in certain > trivial cases, it very much can still result in lockref get/put being in > the picture *and* only taking locks, so I'm not particularly happy with > it. > > This is where I found that going to 60 workers issuing open or stat on > the same dentry *also* permanently degrades, for example: > > # ./stat2_processes -t 60 > testcase:Same file stat > warmup > min:11293 max:25058 total:1164504 > min:10706 max:12977 total:679839 > min:10303 max:12222 total:643437 > min:8722 max:9846 total:542022 > min:8514 max:9554 total:529731 > min:8349 max:9296 total:519381 > measurement > min:8203 max:9079 total:510768 > min:8122 max:8913 total:504069 > > According to perf top it's all contention on the spinlock and it is the > lockref routines taking it. So something sneaks in a lock/unlock cycle > and there is enough traffic that the default 100 spins with the patch > are not sufficient to wait it out. With the configurable spin count I > found this can stay lockless if I make it wait 1000 spins. Getting the > spin count "wrong" further reduces performance -- for example with 700 > spin limit it was not enough to wait out the lock holders and throughput > went down to ~269k due to time wasted spinning, merely a little more > than half of the degraded state above. > > That is to say messing with lockref itself is probably best avoided as > getting a win depends on being at the right (low) scale or getting > "lucky". > > My worry about locked-only degradation showing up was not justified in > that for the scales used by lkp folks it already occures. > > All in all the thing to do is to find out who takes the spin lock and if > it can be avoided. Perhaps this can be further split so that only spots > depending on the refcount take the d_lock (for example in the --handle > test the lock is taken to iterate the alias list -- it would not mind > refs going up or down). > > It may be (and I'm really just handwaving here) dentries should move > away from lockref altogether in favor of storing flags in the refcount. > Even if both get/put would still have to cmpxchg they would never have > to concern itself with locking of any sort, except when the dentry is > going down. And probably the unref side could merely lock xadd into it. > I recognize this would convert some regular ops in other parts of the > kernel into atomics and it may be undesirable. Something to ponder > nonetheless. > > At the end of the day: > 1. this patch should be dropped, I don't think there is a good > replacement either > 2. the real fix would do something about the dentry lock itself or its > consumers (maybe the lock is taken way more than it should?) Well there is also the option of going full RCU in the fast path, which I mentioned last time around lockref. This would be applicable at least for stat, fstatx, readlink and access syscalls. It would not help the above bench though, but then again I don't think it is doing anything realistic. I'm going to poke around it when I find the time. Maybe ditching lockref would be a good idea regardless if the above pans out.