On Mon, Jan 13, 2025 at 05:24:27AM +0100, Christoph Hellwig wrote: > Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer > lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now > bc_lock) and only under RCU protection. But this means that nothing > serializes lookups against the temporary 0 reference count for buffers > that are added to the LRU after dropping the last regular reference, > and a concurrent lookup would fail to find them. > > Fix this by doing all b_hold modifications under b_lock. We're already > doing this for release so this "only" ~ doubles the b_lock round trips. > We'll later look into the lockref infrastructure to optimize the number > of lock round trips again. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> Logic changes look ok, but... > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 3d56bc7a35cc..cbf7c2a076c7 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -172,7 +172,8 @@ struct xfs_buf { > > xfs_daddr_t b_rhash_key; /* buffer cache index */ > int b_length; /* size of buffer in BBs */ > - atomic_t b_hold; /* reference count */ > + spinlock_t b_lock; /* internal state lock */ > + unsigned int b_hold; /* reference count */ > atomic_t b_lru_ref; /* lru reclaim ref count */ > xfs_buf_flags_t b_flags; /* status flags */ > struct semaphore b_sema; /* semaphore for lockables */ > @@ -182,7 +183,6 @@ struct xfs_buf { > * bt_lru_lock and not by b_sema > */ > struct list_head b_lru; /* lru list */ > - spinlock_t b_lock; /* internal state lock */ > unsigned int b_state; /* internal state flags */ > int b_io_error; /* internal IO error state */ > wait_queue_head_t b_waiters; /* unpin waiters */ ... I think this is misguided. The idea behind the initial cacheline layout is that it should stay read-only as much as possible so that cache lookups can walk the buffer without causing shared/exclusive cacheline contention with existing buffer users. This was really important back in the days when the cache used a rb-tree (i.e. the rbnode pointers dominated lookup profiles), and it's still likely important with the rhashtable on large caches. i.e. Putting a spinlock in that first cache line will result in lookups and shrinker walks having cacheline contention as the shrinker needs exclusive access for the spin lock, whilst the lookup walk needs shared access for the b_rhash_head, b_rhash_key and b_length fields in _xfs_buf_obj_cmp() for lookless lookup concurrency. Hence I think it would be better to move the b_hold field to the same cacheline as the b_state field rather than move it to the initial cacheline that cache lookups walk... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx