As original poster: Confirmed. We've not seen detrimental effects of it either. On Mon, Apr 25, 2022 at 5:17 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Tue, Apr 05, 2022 at 09:22:04AM +1000, Dave Chinner wrote: > > On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote: > > > Hi, > > > > > > we see machines getting stuck with a large number of backed-up > > > processes that invoke statfs() (monitoring stuff), like: > > > > > > [Sat Apr 2 09:54:32 2022] INFO: task node_exporter:244222 blocked for > > > more than 10 seconds. > > > [Sat Apr 2 09:54:32 2022] Tainted: G O > > > 5.15.26-cloudflare-2022.3.4 #1 > > > > Is this a vanilla kernel, or one that has been patched extensively > > by cloudfare? > > > > > [Sat Apr 2 09:54:32 2022] "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [Sat Apr 2 09:54:32 2022] task:node_exporter state:D stack: 0 > > > pid:244222 ppid: 1 flags:0x00004000 > > > [Sat Apr 2 09:54:32 2022] Call Trace: > > > [Sat Apr 2 09:54:32 2022] <TASK> > > > [Sat Apr 2 09:54:32 2022] __schedule+0x2cd/0x950 > > > [Sat Apr 2 09:54:32 2022] schedule+0x44/0xa0 > > > [Sat Apr 2 09:54:32 2022] schedule_timeout+0xfc/0x140 > > > [Sat Apr 2 09:54:32 2022] ? try_to_wake_up+0x338/0x4e0 > > > [Sat Apr 2 09:54:32 2022] ? __prepare_to_swait+0x4b/0x70 > > > [Sat Apr 2 09:54:32 2022] wait_for_completion+0x86/0xe0 > > > [Sat Apr 2 09:54:32 2022] flush_work+0x5c/0x80 > > > [Sat Apr 2 09:54:32 2022] ? flush_workqueue_prep_pwqs+0x110/0x110 > > > [Sat Apr 2 09:54:32 2022] xfs_inodegc_flush.part.0+0x3b/0x90 > > > [Sat Apr 2 09:54:32 2022] xfs_fs_statfs+0x29/0x1c0 > > > [Sat Apr 2 09:54:32 2022] statfs_by_dentry+0x4d/0x70 > > > [Sat Apr 2 09:54:32 2022] user_statfs+0x57/0xc0 > > > [Sat Apr 2 09:54:32 2022] __do_sys_statfs+0x20/0x50 > > > [Sat Apr 2 09:54:32 2022] do_syscall_64+0x3b/0x90 > > > [Sat Apr 2 09:54:32 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [Sat Apr 2 09:54:32 2022] RIP: 0033:0x4ac9db > > > > Waiting for background inode gc to complete. > > > > > A linear-over-time increasing number of 'D' state processes is usually > > > what alerts us to this. > > > > > > The oldest thread found waiting appears always to be the inode gc > > > worker doing deferred inactivation: > > > > OK. > > > > > This is a histogram (first column: number of proceses 'D'-ed on that > > > call trace) of `/proc/<PID>/stack`: > > > > It helps greatly if you reformat the stacks back to a readable stack > > (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable. > > > > > 1 stuck on AGF, holding AGI, inode and inode buffer locks > > > > > > down+0x43/0x60 > > > xfs_buf_lock+0x29/0xa0 > > > xfs_buf_find+0x2c4/0x590 > > > xfs_buf_get_map+0x46/0x390 > > > xfs_buf_read_map+0x52/0x270 > > > xfs_trans_read_buf_map+0x128/0x2a0 > > > xfs_read_agf+0x87/0x110 > > > xfs_alloc_read_agf+0x34/0x1a0 > > > xfs_alloc_fix_freelist+0x3d7/0x4f0 > > > xfs_alloc_vextent+0x22b/0x440 > > > __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0 > > > __xfs_btree_split+0xf2/0x610 > > > xfs_btree_split+0x4b/0x100 > > > xfs_btree_make_block_unfull+0x193/0x1c0 > > > xfs_btree_insrec+0x4a9/0x5a0 > > > xfs_btree_insert+0xa8/0x1f0 > > > xfs_difree_finobt+0xa4/0x240 > > > xfs_difree+0x126/0x1a0 > > > xfs_ifree+0xca/0x4a0 > > > xfs_inactive_ifree.isra.0+0x9e/0x1a0 > > > xfs_inactive+0xf8/0x170 > > > xfs_inodegc_worker+0x73/0xf0 > > > process_one_work+0x1e6/0x380 > > > worker_thread+0x50/0x3a0 > > > kthread+0x127/0x150 > > > ret_from_fork+0x22/0x30 > > > > > > 1 stuck on inode buffer, holding inode lock, holding AGF > > > > > > down+0x43/0x60 > > > xfs_buf_lock+0x29/0xa0 > > > xfs_buf_find+0x2c4/0x590 > > > xfs_buf_get_map+0x46/0x390 > > > xfs_buf_read_map+0x52/0x270 > > > xfs_trans_read_buf_map+0x128/0x2a0 > > > xfs_imap_to_bp+0x4e/0x70 > > > xfs_trans_log_inode+0x1d0/0x280 > > > xfs_bmap_btalloc+0x75f/0x820 > > > xfs_bmapi_allocate+0xe4/0x310 > > > xfs_bmapi_convert_delalloc+0x273/0x490 > > > xfs_map_blocks+0x1b5/0x400 > > > iomap_do_writepage+0x11d/0x820 > > > write_cache_pages+0x189/0x3e0 > > > iomap_writepages+0x1c/0x40 > > > xfs_vm_writepages+0x71/0xa0 > > > do_writepages+0xc3/0x1e0 > > > __writeback_single_inode+0x37/0x270 > > > writeback_sb_inodes+0x1ed/0x420 > > > __writeback_inodes_wb+0x4c/0xd0 > > > wb_writeback+0x1ba/0x270 > > > wb_workfn+0x292/0x4d0 > > > process_one_work+0x1e6/0x380 > > > worker_thread+0x50/0x3a0 > > > kthread+0x127/0x150 > > > ret_from_fork+0x22/0x30 > > > > That's the deadlock right there. > > > > task 1 task 2 > > lock inode A lock inode A+1 > > lock AGF B for allocation lock inode cluster > > remove inode from unlinked list > > free inode > > mark inode free in finobt > > insert new rec > > split tree > > lock AGF B for allocation > > <blocks waiting for task 1> > > <allocate blocks> > > > > xfs_bmap_finish > > log inode > > lock inode cluster buffer > > <blocks waiting for task 2> > > > > So this has nothing to do with background inode inactivation. It may > > have made it easier to hit, but it's definitely not *caused* by > > background inodegc as these two operations have always been able to > > run concurrently. > > > > The likely cause is going to be the async memory reclaim work from > > late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer > > to the inode log item") added the xfs_imap_to_bp() call to > > xfs_trans_log_inode() to pin the inode cluster buffer in memory when > > the inode was first dirtied. > > > > Essentially, the problem is that inode unlink list manipulations are > > not consistently ordered with inode allocation/freeing, hence not > > consistently ordered against AGI and AGF locking. I didn't realise > > that there was an AGF component to this problem, otherwise I would > > have sent this patch upstream much sooner: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a > > > > That commit is dated August 2020 - about the same time that the > > async memory reclaim stuff was merged. What this will do is: > > > > task 1 task 2 > > lock inode A lock inode A+1 > > lock AGF B for allocation > > free inode > > mark inode free in finobt > > insert new rec > > split tree > > lock AGF B for allocation > > <blocks waiting for task 1> > > <allocate blocks> > > > > xfs_bmap_finish > > log inode > > lock inode cluster buffer > > attach inode > > unlock inode cluster buffer > > xfs_trans_commit > > ... > > unlock AGF B > > <unblocks, holds AGF B> > > <allocate blocks> > > <completes split> > > <completes insert> > > <completes ifree> > > lock inode cluster buffer > > remove inode from unlinked list > > xfs_trans_commit > > ... > > unlock AGF B > > unlock inode cluster buffer > > > > And so the deadlock should go away. > > > > I've attached the current patch from my local dev tree below. Can > > you try it and see if the problem goes away? > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > > > > > xfs: reorder iunlink remove operation in xfs_ifree > > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > The O_TMPFILE creation implementation creates a specific order of > > operations for inode allocation/freeing and unlinked list > > modification. Currently both are serialised by the AGI, so the order > > doesn't strictly matter as long as the are both in the same > > transaction. > > > > However, if we want to move the unlinked list insertions largely > > out from under the AGI lock, then we have to be concerned about the > > order in which we do unlinked list modification operations. > > O_TMPFILE creation tells us this order is inode allocation/free, > > then unlinked list modification. > > > > Change xfs_ifree() to use this same ordering on unlinked list > > removal. THis way we always guarantee that when we enter the > > iunlinked list removal code from this path, we have the already > > locked and we don't have to worry about lock nesting AGI reads > > inside unlink list locks because it's already locked and attached to > > the transaction. > > > > We can do this safely as the inode freeing and unlinked list removal > > are done in the same transaction and hence are atomic operations > > with resepect to log recovery. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > No weird problems observed due to /this/ patch, so: > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --D > > > > > --- > > fs/xfs/xfs_inode.c | 24 +++++++++++++----------- > > 1 file changed, 13 insertions(+), 11 deletions(-) > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index c66c9585f1ba..f8eaeb83d05e 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -2277,14 +2277,13 @@ xfs_ifree_cluster( > > } > > > > /* > > - * This is called to return an inode to the inode free list. > > - * The inode should already be truncated to 0 length and have > > - * no pages associated with it. This routine also assumes that > > - * the inode is already a part of the transaction. > > + * This is called to return an inode to the inode free list. The inode should > > + * already be truncated to 0 length and have no pages associated with it. This > > + * routine also assumes that the inode is already a part of the transaction. > > * > > - * The on-disk copy of the inode will have been added to the list > > - * of unlinked inodes in the AGI. We need to remove the inode from > > - * that list atomically with respect to freeing it here. > > + * The on-disk copy of the inode will have been added to the list of unlinked > > + * inodes in the AGI. We need to remove the inode from that list atomically with > > + * respect to freeing it here. > > */ > > int > > xfs_ifree( > > @@ -2306,13 +2305,16 @@ xfs_ifree( > > pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); > > > > /* > > - * Pull the on-disk inode from the AGI unlinked list. > > + * Free the inode first so that we guarantee that the AGI lock is going > > + * to be taken before we remove the inode from the unlinked list. This > > + * makes the AGI lock -> unlinked list modification order the same as > > + * used in O_TMPFILE creation. > > */ > > - error = xfs_iunlink_remove(tp, pag, ip); > > + error = xfs_difree(tp, pag, ip->i_ino, &xic); > > if (error) > > - goto out; > > + return error; > > > > - error = xfs_difree(tp, pag, ip->i_ino, &xic); > > + error = xfs_iunlink_remove(tp, pag, ip); > > if (error) > > goto out; > >