Re: [PATCH] xfs: fix unlink vs cluster buffer instantiation race

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 13 Jun 2024 08:07:41 -0700

On Thu, Jun 13, 2024 at 08:51:48AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Luis has been reporting an assert failure when freeing an inode
> cluster during inode inactivation for a while. The assert looks
> like:
> 
>  XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
>  ------------[ cut here ]------------
>  kernel BUG at fs/xfs/xfs_message.c:102!
>  Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>  CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
>  RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
>  RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
>  RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
>  RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
>  RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
>  R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
>  R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
>  FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>  xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
>  xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
>  xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
>  xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
>  __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
>  xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
>  xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
>  xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
>  process_one_work (kernel/workqueue.c:3231)
>  worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
>  kthread (kernel/kthread.c:389)
>  ret_from_fork (arch/x86/kernel/process.c:147)
>  ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
>   </TASK>
> 
> And occurs when the the inode precommit handlers is attempt to look
> up the inode cluster buffer to attach the inode for writeback.
> 
> The trail of logic that I can reconstruct is as follows.
> 
> 	1. the inode is clean when inodegc runs, so it is not
> 	   attached to a cluster buffer when precommit runs.
> 
> 	2. #1 implies the inode cluster buffer may be clean and not
> 	   pinned by dirty inodes when inodegc runs.
> 
> 	3. #2 implies that the inode cluster buffer can be reclaimed
> 	   by memory pressure at any time.
> 
> 	4. The assert failure implies that the cluster buffer was
> 	   attached to the transaction, but not marked done. It had
> 	   been accessed earlier in the transaction, but not marked
> 	   done.
> 
> 	5. #4 implies the cluster buffer has been invalidated (i.e.
> 	   marked stale).
> 
> 	6. #5 implies that the inode cluster buffer was instantiated
> 	   uninitialised in the transaction in xfs_ifree_cluster(),
> 	   which only instantiates the buffers to invalidate them
> 	   and never marks them as done.
> 
> Given factors 1-3, this issue is highly dependent on timing and
> environmental factors. Hence the issue can be very difficult to
> reproduce in some situations, but highly reliable in others. Luis
> has an environment where it can be reproduced easily by g/531 but,
> OTOH, I've reproduced it only once in ~2000 cycles of g/531.
> 
> I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
> on the cluster buffers, even though they may not be initialised. The
> reasons why I think this is safe are:
> 
> 	1. A buffer cache lookup hit on a XBF_STALE buffer will
> 	   clear the XBF_DONE flag. Hence all future users of the
> 	   buffer know they have to re-initialise the contents
> 	   before use and mark it done themselves.
> 
> 	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
> 	   means the buffer remains locked until the journal commit
> 	   completes and the buffer is unpinned. Hence once marked
> 	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
> 	   context that can access the freed buffer is the currently
> 	   running transaction.
> 
> 	3. #2 implies that future buffer lookups in the currently
> 	   running transaction will hit the transaction match code
> 	   and not the buffer cache. Hence XBF_STALE and
> 	   XFS_BLI_STALE will not be cleared unless the transaction
> 	   initialises and logs the buffer with valid contents
> 	   again. At which point, the buffer will be marked marked
> 	   XBF_DONE again, so having XBF_DONE already set on the
> 	   stale buffer is a moot point.
> 
> 	4. #2 also implies that any concurrent access to that
> 	   cluster buffer will block waiting on the buffer lock
> 	   until the inode cluster has been fully freed and is no
> 	   longer an active inode cluster buffer.
> 
> 	5. #4 + #1 means that any future user of the disk range of
> 	   that buffer will always see the range of disk blocks
> 	   covered by the cluster buffer as not done, and hence must
> 	   initialise the contents themselves.
> 
> 	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
> 	   unlinked inode precommit code will see a XBF_DONE buffer
> 	   from the transaction match as it expects. It can then
> 	   attach the stale but newly dirtied inode to the stale
> 	   but newly dirtied cluster buffer without unexpected
> 	   failures. The stale buffer will then sail through the
> 	   journal and do the right thing with the attached stale
> 	   inode during unpin.
> 
> Hence the fix is just one line of extra code. The explanation of
> why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
> complex....
> 
> Fixes: 82842fee6e59 ("xfs: fix AGF vs inode cluster buffer deadlock")
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Tested-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>

I'm pretty sure this also fixes the same symptoms that I can (very very
occasionally) reproduce.  Though, I haven't seen it in a while.

Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> ---
>  fs/xfs/xfs_inode.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 58fb7a5062e1..f36091e1e7f5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2548,11 +2548,26 @@ xfs_ifree_cluster(
>  		 * This buffer may not have been correctly initialised as we
>  		 * didn't read it from disk. That's not important because we are
>  		 * only using to mark the buffer as stale in the log, and to
> -		 * attach stale cached inodes on it. That means it will never be
> -		 * dispatched for IO. If it is, we want to know about it, and we
> -		 * want it to fail. We can acheive this by adding a write
> -		 * verifier to the buffer.
> +		 * attach stale cached inodes on it.
> +		 *
> +		 * For the inode that triggered the cluster freeing, this
> +		 * attachment may occur in xfs_inode_item_precommit() after we
> +		 * have marked this buffer stale.  If this buffer was not in
> +		 * memory before xfs_ifree_cluster() started, it will not be
> +		 * marked XBF_DONE and this will cause problems later in
> +		 * xfs_inode_item_precommit() when we trip over a (stale, !done)
> +		 * buffer to attached to the transaction.
> +		 *
> +		 * Hence we have to mark the buffer as XFS_DONE here. This is
> +		 * safe because we are also marking the buffer as XBF_STALE and
> +		 * XFS_BLI_STALE. That means it will never be dispatched for
> +		 * IO and it won't be unlocked until the cluster freeing has
> +		 * been committed to the journal and the buffer unpinned. If it
> +		 * is written, we want to know about it, and we want it to
> +		 * fail. We can acheive this by adding a write verifier to the
> +		 * buffer.
>  		 */
> +		bp->b_flags |= XBF_DONE;
>  		bp->b_ops = &xfs_inode_buf_ops;
>  
>  		/*
> -- 
> 2.45.1
> 
>