Re: [PATCH 2/2] xfs: fix buffer lookup vs release race

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 14 Jan 2025 07:55:30 +1100

On Mon, Jan 13, 2025 at 05:24:27AM +0100, Christoph Hellwig wrote:
> Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer
> lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now
> bc_lock) and only under RCU protection.  But this means that nothing
> serializes lookups against the temporary 0 reference count for buffers
> that are added to the LRU after dropping the last regular reference,
> and a concurrent lookup would fail to find them.
> 
> Fix this by doing all b_hold modifications under b_lock.  We're already
> doing this for release so this "only" ~ doubles the b_lock round trips.
> We'll later look into the lockref infrastructure to optimize the number
> of lock round trips again.
> 
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>

Logic changes look ok, but...

> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 3d56bc7a35cc..cbf7c2a076c7 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -172,7 +172,8 @@ struct xfs_buf {
>  
>  	xfs_daddr_t		b_rhash_key;	/* buffer cache index */
>  	int			b_length;	/* size of buffer in BBs */
> -	atomic_t		b_hold;		/* reference count */
> +	spinlock_t		b_lock;		/* internal state lock */
> +	unsigned int		b_hold;		/* reference count */
>  	atomic_t		b_lru_ref;	/* lru reclaim ref count */
>  	xfs_buf_flags_t		b_flags;	/* status flags */
>  	struct semaphore	b_sema;		/* semaphore for lockables */
> @@ -182,7 +183,6 @@ struct xfs_buf {
>  	 * bt_lru_lock and not by b_sema
>  	 */
>  	struct list_head	b_lru;		/* lru list */
> -	spinlock_t		b_lock;		/* internal state lock */
>  	unsigned int		b_state;	/* internal state flags */
>  	int			b_io_error;	/* internal IO error state */
>  	wait_queue_head_t	b_waiters;	/* unpin waiters */

... I think this is misguided.

The idea behind the initial cacheline layout is that it should stay
read-only as much as possible so that cache lookups can walk the
buffer without causing shared/exclusive cacheline contention with
existing buffer users.

This was really important back in the days when the cache used a
rb-tree (i.e. the rbnode pointers dominated lookup profiles), and
it's still likely important with the rhashtable on large caches.

i.e. Putting a spinlock in that first cache line will result in
lookups and shrinker walks having cacheline contention as the
shrinker needs exclusive access for the spin lock, whilst the lookup
walk needs shared access for the b_rhash_head, b_rhash_key and
b_length fields in _xfs_buf_obj_cmp() for lookless lookup
concurrency.

Hence I think it would be better to move the b_hold field to the
same cacheline as the b_state field rather than move it to the
initial cacheline that cache lookups walk...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx