On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > > When we tested that initially, the performance devolved significantly > > > > due to contention for the parent's d_lockref spinlock. > > > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > > task takes the spinlock, they all end up taking the spinlock, which is > > > > much more costly than a single cmpxchg operation. The second problem is > > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > > spinlock. The upshot there is that even moderate contention can cause a > > > > fallback to serialized spinlocking, which worsens performance. > > > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > > > First, change the loop to spin instead of falling back to a locked > > > > codepath when the spinlock is held. Once the lock is released, allow the > > > > task to continue trying its cmpxchg loop as before instead of taking the > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > > Just continue infinitely. > > > > > > > > This greatly reduces contention on the lockref when there are large > > > > numbers of concurrent increments and decrements occurring. > > > > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > > > > Oh wait I misread the patch based on what I tried there. Spinning > > indefinitely waiting for the lock to be free is a no-go as it loses > > the forward progress guarantee (and it is possible to get the lock > > being continuously held). Only spinning up to an arbitrary point wins > > some in some tests and loses in others. > > > > I'm a little confused about the forward progress guarantee here. Does > that exist today at all? ISTM that falling back to spin_lock() after a > certain number of retries doesn't guarantee any forward progress. You > can still just end up spinning on the lock forever once that happens, > no? > There is the implicit assumption that everyone holds locks for a finite time. I agree there are no guarantees otherwise if that's what you meant. In this case, since spinlocks are queued, a constant stream of lock holders will make the lock appear taken indefinitely even if they all hold it for a short period. Stock lockref will give up atomics immediately and make sure to change the ref thanks to queueing up. Lockref as proposed in this patch wont be able to do anything as long as the lock trading is taking place. > > Either way, as described below, chances are decent that: > > 1. there is an easy way to not lockref_get/put on the parent if the > > file is already there, dodging the problem > > .. and even if that's not true > > 2. lockref can be ditched in favor of atomics. apart from some minor > > refactoring this all looks perfectly doable and I have a wip. I will > > try to find the time next week to sort it out > > > > Like I said in the earlier mail, I don't think we can stay in RCU mode > because of the audit_inode call. I'm definitely interested in your WIP > though! > well audit may be hackable so that it works in rcu most of the time, but that's not something i'm interested in sorting out the lockref situation would definitely help other stuff (notably opening the same file RO). anyhow one idea is to temporarily disable atomic ops with a flag in the counter, a fallback plan is to loosen lockref so that it can do transitions other than 0->1->2 with atomics, even if the lock is held. I have not looked at this in over a month, I'm going to need to refresh my memory on the details, I do remember there was some stuff to massage first. Anyhow, I expect a working WIP some time in the upcoming week. -- Mateusz Guzik <mjguzik gmail.com>