Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression

Mateusz Guzik <mjguzik@xxxxxxxxx> · Tue, 2 Jul 2024 09:19:10 +0200

On Thu, Jun 27, 2024 at 10:41:35AM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -33.7% regression of unixbench.throughput on:
> 
> 
> commit: d042dae6ad74df8a00ee8a3c6b77b53bc9e32f64 ("lockref: speculatively spin waiting for the lock to be released")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
[..]
> | testcase: change | stress-ng: stress-ng.getdent.ops_per_sec -56.5% regression                                |
[..]
> | testcase: change | stress-ng: stress-ng.handle.ops_per_sec -60.2% regression                                 |
> | test machine     | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
[..]

Both good and bad news is that this particular patch should be dropped
(but struct dentry reordering should stay). There is some gnarly stuff
here, please read the entire thing, even though it may seem rambly at
times.

To recap there can be a microbenchmark setting with continuous get/put
traffic where only sporadically someone takes the lock. The constant
traffic paired with immediate fallback to locking makes the primitives
degrade to locked operation until benchmark is concluded.  This is the
problem the patch tried to address as with dentry reordering I found
myself running into the degraded state by accident a lot. There was
worry lkp machinery would do too, disfiguring their results.

However, another microbenchmark setting can be mixing explicit spinlock
acquire with lockref get/put. With high enough worker count the lock can
be continuously held, making any spinning waiting for it to be released
actively detrimental to performance. This is what the lkp folks ran
into.

So I got a 2 socket * 48 cores * 2 threads machine to play with.

Both unixbench and stress-ng --getdent testcases are heavily mixed with
read-write locking and mutexes, while stress-ng --handle is all about
spinlocks and 1/3rd of it is lockref thus I only benchmarked the last
one.

At a *low* scale of 24 workers at my usual test box performance improves
immensely (from ~165k ops/s to ~620k ops/s) as locking is avoided plenty
of times.

At 192 workers on the big box stock kernel achieves ~134k ops/s, while
spinwait drops it to ~42k. As an experiment I added a dedicated waiting
loop with configurable spin count. Doing merely one spin drops the rate
to ~124k ops/s.

I figured I'm going to write the smallest patch which avoids locked-only
degradation and call it a day -- for example it is possible to check if
the lock is contested thanks to the arch_spin_is_contended macro.

Then the loop could be:
	if (lockref_locked(old)) {
		if (locked_contested(old))
			break;
		cpu_relax();
		old.lock_count = READ_ONCE(lockref->lock_count);
		continue;
	}

Note how this avoids any additional accesses to the cacheline if the
lock is under traffic. While this would avoid degrading in certain
trivial cases, it very much can still result in lockref get/put being in
the picture *and* only taking locks, so I'm not particularly happy with
it.

This is where I found that going to 60 workers issuing open or stat on
the same dentry *also* permanently degrades, for example:

# ./stat2_processes -t 60
testcase:Same file stat
warmup
min:11293 max:25058 total:1164504
min:10706 max:12977 total:679839
min:10303 max:12222 total:643437
min:8722 max:9846 total:542022
min:8514 max:9554 total:529731
min:8349 max:9296 total:519381
measurement
min:8203 max:9079 total:510768
min:8122 max:8913 total:504069

According to perf top it's all contention on the spinlock and it is the
lockref routines taking it. So something sneaks in a lock/unlock cycle
and there is enough traffic that the default 100 spins with the patch
are not sufficient to wait it out. With the configurable spin count I
found this can stay lockless if I make it wait 1000 spins. Getting the
spin count "wrong" further reduces performance -- for example with 700
spin limit it was not enough to wait out the lock holders and throughput
went down to ~269k due to time wasted spinning, merely a little more
than half of the degraded state above.

That is to say messing with lockref itself is probably best avoided as
getting a win depends on being at the right (low) scale or getting
"lucky".

My worry about locked-only degradation showing up was not justified in
that for the scales used by lkp folks it already occures.

All in all the thing to do is to find out who takes the spin lock and if
it can be avoided. Perhaps this can be further split so that only spots
depending on the refcount take the d_lock (for example in the --handle
test the lock is taken to iterate the alias list -- it would not mind
refs going up or down).

It may be (and I'm really just handwaving here) dentries should move
away from lockref altogether in favor of storing flags in the refcount.
Even if both get/put would still have to cmpxchg they would never have
to concern itself with locking of any sort, except when the dentry is
going down. And probably the unref side could merely lock xadd into it.
I recognize this would convert some regular ops in other parts of the
kernel into atomics and it may be undesirable. Something to ponder
nonetheless.

At the end of the day:
1. this patch should be dropped, I don't think there is a good
replacement either
2. the real fix would do something about the dentry lock itself or its
consumers (maybe the lock is taken way more than it should?)