On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > When we tested that initially, the performance devolved significantly > > > due to contention for the parent's d_lockref spinlock. > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > task takes the spinlock, they all end up taking the spinlock, which is > > > much more costly than a single cmpxchg operation. The second problem is > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > spinlock. The upshot there is that even moderate contention can cause a > > > fallback to serialized spinlocking, which worsens performance. > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > First, change the loop to spin instead of falling back to a locked > > > codepath when the spinlock is held. Once the lock is released, allow the > > > task to continue trying its cmpxchg loop as before instead of taking the > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > Just continue infinitely. > > > > > > This greatly reduces contention on the lockref when there are large > > > numbers of concurrent increments and decrements occurring. > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > Oh wait I misread the patch based on what I tried there. Spinning > indefinitely waiting for the lock to be free is a no-go as it loses > the forward progress guarantee (and it is possible to get the lock > being continuously held). Only spinning up to an arbitrary point wins > some in some tests and loses in others. > I'm a little confused about the forward progress guarantee here. Does that exist today at all? ISTM that falling back to spin_lock() after a certain number of retries doesn't guarantee any forward progress. You can still just end up spinning on the lock forever once that happens, no? > Either way, as described below, chances are decent that: > 1. there is an easy way to not lockref_get/put on the parent if the > file is already there, dodging the problem > .. and even if that's not true > 2. lockref can be ditched in favor of atomics. apart from some minor > refactoring this all looks perfectly doable and I have a wip. I will > try to find the time next week to sort it out > Like I said in the earlier mail, I don't think we can stay in RCU mode because of the audit_inode call. I'm definitely interested in your WIP though! > > Key problem is that in some corner cases the lock can be continuously > > held and be queued on, making the fast path always fail and making all > > the spins actively waste time (and notably pull on the cacheline). > > > > See this for more details: > > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > > of an existing file) lockref on the parent can be avoided altogether > > with some hackery and that's what should be done here. > > > > When it comes to lockref in vfs in general, most uses can be elided with > > some hackery (see the above thread) which is in early WIP (the LSMs are > > a massive headache). > > > > For open calls which *do* need to take a real ref the hackery does not > > help of course. > > > > This is where I think decoupling ref from the lock is the best way > > forward. For that to work the dentry must hang around after the last > > unref (already done thanks to RCU and dput even explicitly handles that > > already!) and there needs to be a way to block new refs atomically -- > > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > > coming in. I have that as a WIP as well. > > > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > --- > > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > > index 2afe4c5d8919..b76941043fe9 100644 > > > --- a/lib/lockref.c > > > +++ b/lib/lockref.c > > > @@ -8,22 +8,25 @@ > > > * Note that the "cmpxchg()" reloads the "old" value for the > > > * failure case. > > > */ > > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > > - int retry = 100; \ > > > - struct lockref old; \ > > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > > - struct lockref new = old; \ > > > - CODE \ > > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > > - &old.lock_count, \ > > > - new.lock_count))) { \ > > > - SUCCESS; \ > > > - } \ > > > - if (!--retry) \ > > > - break; \ > > > - } \ > > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > > + struct lockref old; \ > > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > > + for (;;) { \ > > > + struct lockref new = old; \ > > > + \ > > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > > + CODE \ > > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > > + &old.lock_count, \ > > > + new.lock_count))) { \ > > > + SUCCESS; \ > > > + } \ > > > + } else { \ > > > + cpu_relax(); \ > > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > > + } \ > > > + } \ > > > } while (0) > > > > > > #else > > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > > , > > > return; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - lockref->count++; > > > - spin_unlock(&lockref->lock); > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > } > > > EXPORT_SYMBOL(lockref_get); > > > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > > */ > > > int lockref_get_not_zero(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count++; > > > if (old.count <= 0) > > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count > 0) { > > > - lockref->count++; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > > */ > > > int lockref_put_not_zero(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count--; > > > if (old.count <= 1) > > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count > 1) { > > > - lockref->count--; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > > , > > > return new.count; > > > ); > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > return -1; > > > } > > > EXPORT_SYMBOL(lockref_put_return); > > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > > */ > > > int lockref_get_not_dead(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count++; > > > if (old.count < 0) > > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count >= 0) { > > > - lockref->count++; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > > > -- > > > 2.45.2 > > > > > > -- Jeff Layton <jlayton@xxxxxxxxxx>