On Mon, Jun 27, 2022 at 04:08:35PM +1000, Dave Chinner wrote: > Hi folks, > > Current work to merge the XFS inode life cycle with the VFS indoe > life cycle is finding some interesting issues. If we have a path > that hits buffer trylocks fairly hard (e.g. a non-blocking > background inode freeing function), we end up hitting massive > contention on the buffer cache hash locks: > > - 92.71% 0.05% [kernel] [k] xfs_inodegc_worker > - 92.67% xfs_inodegc_worker > - 92.13% xfs_inode_unlink > - 91.52% xfs_inactive_ifree > - 85.63% xfs_read_agi > - 85.61% xfs_trans_read_buf_map > - 85.59% xfs_buf_read_map > - xfs_buf_get_map > - 85.55% xfs_buf_find > - 72.87% _raw_spin_lock > - do_raw_spin_lock > 71.86% __pv_queued_spin_lock_slowpath > - 8.74% xfs_buf_rele > - 7.88% _raw_spin_lock > - 7.88% do_raw_spin_lock > 7.63% __pv_queued_spin_lock_slowpath > - 1.70% xfs_buf_trylock > - 1.68% down_trylock > - 1.41% _raw_spin_lock_irqsave > - 1.39% do_raw_spin_lock > __pv_queued_spin_lock_slowpath > - 0.76% _raw_spin_unlock > 0.75% do_raw_spin_unlock > > This is basically hammering the pag->pag_buf_lock from lots of CPUs > doing trylocks at the same time. Most of the buffer trylock > operations ultimately fail after we've done the lookup, so we're > really hammering the buf hash lock whilst making no progress. > > We can also see significant spinlock traffic on the same lock just > under normal operation when lots of tasks are accessing metadata > from the same AG, so let's avoid all this by creating a lookup fast > path which leverages the rhashtable's ability to do rcu protected > lookups. > > This is a rework of the initial lockless buffer lookup patch I sent > here: > > https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@xxxxxxxxxxxxx/ > > And the alternative cleanup sent by Christoph here: > > https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@xxxxxx/ > > This version isn't quite a short as Christophs, but it does roughly > the same thing in killing the two-phase _xfs_buf_find() call > mechanism. It separates the fast and slow paths a little more > cleanly and doesn't have context dependent buffer return state from > the slow path that the caller needs to handle. It also picks up the > rhashtable insert optimisation that Christoph added. > > This series passes fstests under several different configs and does > not cause any obvious regressions in scalability testing that has > been performed. Hence I'm proposing this as potential 5.20 cycle > material. > > Thoughts, comments? Any chance there'll be a v3 (or just responses to the replies sent so far) in time for 5.20? --D > Version 2: > - based on 5.19-rc2 > - high speed collision of original proposals. > > Initial versions: > - https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@xxxxxx/ > - https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@xxxxxxxxxxxxx/ > >