Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 25 Apr 2022 09:17:50 -0700

On Tue, Apr 05, 2022 at 09:22:04AM +1000, Dave Chinner wrote:
> On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
> > Hi,
> > 
> > we see machines getting stuck with a large number of backed-up
> > processes that invoke statfs() (monitoring stuff), like:
> >
> > [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
> > more than 10 seconds.
> > [Sat Apr  2 09:54:32 2022]       Tainted: G           O
> > 5.15.26-cloudflare-2022.3.4 #1
> 
> Is this a vanilla kernel, or one that has been patched extensively
> by cloudfare?
> 
> > [Sat Apr  2 09:54:32 2022] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
> > pid:244222 ppid:     1 flags:0x00004000
> > [Sat Apr  2 09:54:32 2022] Call Trace:
> > [Sat Apr  2 09:54:32 2022]  <TASK>
> > [Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
> > [Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
> > [Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
> > [Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
> > [Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
> > [Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
> > [Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
> > [Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
> > [Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
> > [Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
> > [Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
> > [Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
> > [Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
> > [Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
> > [Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db
> 
> Waiting for background inode gc to complete.
> 
> > A linear-over-time increasing number of 'D' state processes is usually
> > what alerts us to this.
> > 
> > The oldest thread found waiting appears always to be the inode gc
> > worker doing deferred inactivation:
> 
> OK.
> 
> > This is a histogram (first column: number of proceses 'D'-ed on that
> > call trace) of `/proc/<PID>/stack`:
> 
> It helps greatly if you reformat the stacks back to a readable stack
> (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.
> 
> > 1 stuck on AGF, holding AGI, inode and inode buffer locks
> > 
> > down+0x43/0x60 
> > xfs_buf_lock+0x29/0xa0 
> > xfs_buf_find+0x2c4/0x590
> > xfs_buf_get_map+0x46/0x390 
> > xfs_buf_read_map+0x52/0x270 
> > xfs_trans_read_buf_map+0x128/0x2a0 
> > xfs_read_agf+0x87/0x110 
> > xfs_alloc_read_agf+0x34/0x1a0 
> > xfs_alloc_fix_freelist+0x3d7/0x4f0 
> > xfs_alloc_vextent+0x22b/0x440 
> > __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0 
> > __xfs_btree_split+0xf2/0x610 
> > xfs_btree_split+0x4b/0x100 
> > xfs_btree_make_block_unfull+0x193/0x1c0 
> > xfs_btree_insrec+0x4a9/0x5a0 
> > xfs_btree_insert+0xa8/0x1f0 
> > xfs_difree_finobt+0xa4/0x240 
> > xfs_difree+0x126/0x1a0 
> > xfs_ifree+0xca/0x4a0 
> > xfs_inactive_ifree.isra.0+0x9e/0x1a0 
> > xfs_inactive+0xf8/0x170 
> > xfs_inodegc_worker+0x73/0xf0 
> > process_one_work+0x1e6/0x380 
> > worker_thread+0x50/0x3a0 
> > kthread+0x127/0x150 
> > ret_from_fork+0x22/0x30
> > 
> > 1	stuck on inode buffer, holding inode lock, holding AGF
> > 
> > down+0x43/0x60 
> > xfs_buf_lock+0x29/0xa0 
> > xfs_buf_find+0x2c4/0x590
> > xfs_buf_get_map+0x46/0x390 
> > xfs_buf_read_map+0x52/0x270 
> > xfs_trans_read_buf_map+0x128/0x2a0 
> > xfs_imap_to_bp+0x4e/0x70 
> > xfs_trans_log_inode+0x1d0/0x280 
> > xfs_bmap_btalloc+0x75f/0x820 
> > xfs_bmapi_allocate+0xe4/0x310 
> > xfs_bmapi_convert_delalloc+0x273/0x490 
> > xfs_map_blocks+0x1b5/0x400
> > iomap_do_writepage+0x11d/0x820 
> > write_cache_pages+0x189/0x3e0 
> > iomap_writepages+0x1c/0x40 
> > xfs_vm_writepages+0x71/0xa0 
> > do_writepages+0xc3/0x1e0 
> > __writeback_single_inode+0x37/0x270 
> > writeback_sb_inodes+0x1ed/0x420 
> > __writeback_inodes_wb+0x4c/0xd0 
> > wb_writeback+0x1ba/0x270 
> > wb_workfn+0x292/0x4d0 
> > process_one_work+0x1e6/0x380 
> > worker_thread+0x50/0x3a0 
> > kthread+0x127/0x150 
> > ret_from_fork+0x22/0x30
> 
> That's the deadlock right there.
> 
> task 1				task 2
> lock inode A			lock inode A+1
> lock AGF B for allocation	lock inode cluster
> 				remove inode from unlinked list
> 				free inode
> 				mark inode free in finobt
> 				  insert new rec
> 				    split tree
> 				      lock AGF B for allocation
> 				      <blocks waiting for task 1>
> <allocate blocks>
> 
> xfs_bmap_finish
>   log inode
>     lock inode cluster buffer
>     <blocks waiting for task 2>
> 
> So this has nothing to do with background inode inactivation. It may
> have made it easier to hit, but it's definitely not *caused* by
> background inodegc as these two operations have always been able to
> run concurrently.
> 
> The likely cause is going to be the async memory reclaim work from
> late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
> to the inode log item") added the xfs_imap_to_bp() call to
> xfs_trans_log_inode() to pin the inode cluster buffer in memory when
> the inode was first dirtied.
> 
> Essentially, the problem is that inode unlink list manipulations are
> not consistently ordered with inode allocation/freeing, hence not
> consistently ordered against AGI and AGF locking. I didn't realise
> that there was an AGF component to this problem, otherwise I would
> have sent this patch upstream much sooner:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a
> 
> That commit is dated August 2020 - about the same time that the
> async memory reclaim stuff was merged. What this will do is:
> 
> task 1				task 2
> lock inode A			lock inode A+1
> lock AGF B for allocation
> 				free inode
> 				mark inode free in finobt
> 				  insert new rec
> 				    split tree
> 				      lock AGF B for allocation
> 				      <blocks waiting for task 1>
> <allocate blocks>
> 
> xfs_bmap_finish
>   log inode
>     lock inode cluster buffer
>     attach inode
>     unlock inode cluster buffer
> xfs_trans_commit
>   ...
>   unlock AGF B
> 				      <unblocks, holds AGF B>
> 				      <allocate blocks>
> 				    <completes split>
> 				  <completes insert>
> 				<completes ifree>
> 				lock inode cluster buffer
> 				remove inode from unlinked list
> 				xfs_trans_commit
> 				  ...
> 				  unlock AGF B
> 				  unlock inode cluster buffer
> 
> And so the deadlock should go away.
> 
> I've attached the current patch from my local dev tree below. Can
> you try it and see if the problem goes away?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> 
> xfs: reorder iunlink remove operation in xfs_ifree
> 
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> The O_TMPFILE creation implementation creates a specific order of
> operations for inode allocation/freeing and unlinked list
> modification. Currently both are serialised by the AGI, so the order
> doesn't strictly matter as long as the are both in the same
> transaction.
> 
> However, if we want to move the unlinked list insertions largely
> out from under the AGI lock, then we have to be concerned about the
> order in which we do unlinked list modification operations.
> O_TMPFILE creation tells us this order is inode allocation/free,
> then unlinked list modification.
> 
> Change xfs_ifree() to use this same ordering on unlinked list
> removal. THis way we always guarantee that when we enter the
> iunlinked list removal code from this path, we have the already
> locked and we don't have to worry about lock nesting AGI reads
> inside unlink list locks because it's already locked and attached to
> the transaction.
> 
> We can do this safely as the inode freeing and unlinked list removal
> are done in the same transaction and hence are atomic operations
> with resepect to log recovery.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

No weird problems observed due to /this/ patch, so:

Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> 
> ---
>  fs/xfs/xfs_inode.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c66c9585f1ba..f8eaeb83d05e 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2277,14 +2277,13 @@ xfs_ifree_cluster(
>  }
>  
>  /*
> - * This is called to return an inode to the inode free list.
> - * The inode should already be truncated to 0 length and have
> - * no pages associated with it.  This routine also assumes that
> - * the inode is already a part of the transaction.
> + * This is called to return an inode to the inode free list.  The inode should
> + * already be truncated to 0 length and have no pages associated with it.  This
> + * routine also assumes that the inode is already a part of the transaction.
>   *
> - * The on-disk copy of the inode will have been added to the list
> - * of unlinked inodes in the AGI. We need to remove the inode from
> - * that list atomically with respect to freeing it here.
> + * The on-disk copy of the inode will have been added to the list of unlinked
> + * inodes in the AGI. We need to remove the inode from that list atomically with
> + * respect to freeing it here.
>   */
>  int
>  xfs_ifree(
> @@ -2306,13 +2305,16 @@ xfs_ifree(
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>  
>  	/*
> -	 * Pull the on-disk inode from the AGI unlinked list.
> +	 * Free the inode first so that we guarantee that the AGI lock is going
> +	 * to be taken before we remove the inode from the unlinked list. This
> +	 * makes the AGI lock -> unlinked list modification order the same as
> +	 * used in O_TMPFILE creation.
>  	 */
> -	error = xfs_iunlink_remove(tp, pag, ip);
> +	error = xfs_difree(tp, pag, ip->i_ino, &xic);
>  	if (error)
> -		goto out;
> +		return error;
>  
> -	error = xfs_difree(tp, pag, ip->i_ino, &xic);
> +	error = xfs_iunlink_remove(tp, pag, ip);
>  	if (error)
>  		goto out;
>