On Sat, 2024-08-03 at 13:21 +0200, Mateusz Guzik wrote: > On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > > > When we tested that initially, the performance devolved significantly > > > > > due to contention for the parent's d_lockref spinlock. > > > > > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > > > task takes the spinlock, they all end up taking the spinlock, which is > > > > > much more costly than a single cmpxchg operation. The second problem is > > > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > > > spinlock. The upshot there is that even moderate contention can cause a > > > > > fallback to serialized spinlocking, which worsens performance. > > > > > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > > > > > First, change the loop to spin instead of falling back to a locked > > > > > codepath when the spinlock is held. Once the lock is released, allow the > > > > > task to continue trying its cmpxchg loop as before instead of taking the > > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > > > Just continue infinitely. > > > > > > > > > > This greatly reduces contention on the lockref when there are large > > > > > numbers of concurrent increments and decrements occurring. > > > > > > > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > > > > > > > Oh wait I misread the patch based on what I tried there. Spinning > > > indefinitely waiting for the lock to be free is a no-go as it loses > > > the forward progress guarantee (and it is possible to get the lock > > > being continuously held). Only spinning up to an arbitrary point wins > > > some in some tests and loses in others. > > > > > > > I'm a little confused about the forward progress guarantee here. Does > > that exist today at all? ISTM that falling back to spin_lock() after a > > certain number of retries doesn't guarantee any forward progress. You > > can still just end up spinning on the lock forever once that happens, > > no? > > > > There is the implicit assumption that everyone holds locks for a > finite time. I agree there are no guarantees otherwise if that's what > you meant. > > In this case, since spinlocks are queued, a constant stream of lock > holders will make the lock appear taken indefinitely even if they all > hold it for a short period. > > Stock lockref will give up atomics immediately and make sure to change > the ref thanks to queueing up. > > Lockref as proposed in this patch wont be able to do anything as long > as the lock trading is taking place. > Got it, thanks. This spinning is very simplistic, so I could see that you could have one task continually getting shuffled to the end of the queue. > > > Either way, as described below, chances are decent that: > > > 1. there is an easy way to not lockref_get/put on the parent if the > > > file is already there, dodging the problem > > > .. and even if that's not true > > > 2. lockref can be ditched in favor of atomics. apart from some minor > > > refactoring this all looks perfectly doable and I have a wip. I will > > > try to find the time next week to sort it out > > > > > > > Like I said in the earlier mail, I don't think we can stay in RCU mode > > because of the audit_inode call. I'm definitely interested in your WIP > > though! > > > > well audit may be hackable so that it works in rcu most of the time, > but that's not something i'm interested in Audit not my favorite area of the kernel to work in either. I don't see a good way to make it rcu-friendly, but I haven't looked too hard yet either. It would be nice to be able to do some of the auditing under rcu or spinlock. > > sorting out the lockref situation would definitely help other stuff > (notably opening the same file RO). > Indeed. It's clear that the current implementation is a real scalability problem in a lot of situations. > anyhow one idea is to temporarily disable atomic ops with a flag in > the counter, a fallback plan is to loosen lockref so that it can do > transitions other than 0->1->2 with atomics, even if the lock is held. > > I have not looked at this in over a month, I'm going to need to > refresh my memory on the details, I do remember there was some stuff > to massage first. > > Anyhow, I expect a working WIP some time in the upcoming week. > Great, I'll stay tuned. Thanks! -- Jeff Layton <jlayton@xxxxxxxxxx>