On Fri, Apr 15, 2022 at 02:39:51PM +0200, Holger Hoffstätte wrote: > On 2022-04-05 01:22, Dave Chinner wrote: > > On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote: > > > Hi, > > > > > > we see machines getting stuck with a large number of backed-up > > > processes that invoke statfs() (monitoring stuff), like: > > > > > > [Sat Apr 2 09:54:32 2022] INFO: task node_exporter:244222 blocked for > > > more than 10 seconds. > > > [Sat Apr 2 09:54:32 2022] Tainted: G O > > > 5.15.26-cloudflare-2022.3.4 #1 > > > > Is this a vanilla kernel, or one that has been patched extensively > > by cloudfare? > > > > > [Sat Apr 2 09:54:32 2022] "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [Sat Apr 2 09:54:32 2022] task:node_exporter state:D stack: 0 > > > pid:244222 ppid: 1 flags:0x00004000 > > > [Sat Apr 2 09:54:32 2022] Call Trace: > > > [Sat Apr 2 09:54:32 2022] <TASK> > > > [Sat Apr 2 09:54:32 2022] __schedule+0x2cd/0x950 > > > [Sat Apr 2 09:54:32 2022] schedule+0x44/0xa0 > > > [Sat Apr 2 09:54:32 2022] schedule_timeout+0xfc/0x140 > > > [Sat Apr 2 09:54:32 2022] ? try_to_wake_up+0x338/0x4e0 > > > [Sat Apr 2 09:54:32 2022] ? __prepare_to_swait+0x4b/0x70 > > > [Sat Apr 2 09:54:32 2022] wait_for_completion+0x86/0xe0 > > > [Sat Apr 2 09:54:32 2022] flush_work+0x5c/0x80 > > > [Sat Apr 2 09:54:32 2022] ? flush_workqueue_prep_pwqs+0x110/0x110 > > > [Sat Apr 2 09:54:32 2022] xfs_inodegc_flush.part.0+0x3b/0x90 > > > [Sat Apr 2 09:54:32 2022] xfs_fs_statfs+0x29/0x1c0 > > > [Sat Apr 2 09:54:32 2022] statfs_by_dentry+0x4d/0x70 > > > [Sat Apr 2 09:54:32 2022] user_statfs+0x57/0xc0 > > > [Sat Apr 2 09:54:32 2022] __do_sys_statfs+0x20/0x50 > > > [Sat Apr 2 09:54:32 2022] do_syscall_64+0x3b/0x90 > > > [Sat Apr 2 09:54:32 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [Sat Apr 2 09:54:32 2022] RIP: 0033:0x4ac9db > > > > Waiting for background inode gc to complete. > > > > > A linear-over-time increasing number of 'D' state processes is usually > > > what alerts us to this. > > > > > > The oldest thread found waiting appears always to be the inode gc > > > worker doing deferred inactivation: > > > > OK. > > > > > This is a histogram (first column: number of proceses 'D'-ed on that > > > call trace) of `/proc/<PID>/stack`: > > > > It helps greatly if you reformat the stacks back to a readable stack > > (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable. > > > > > 1 stuck on AGF, holding AGI, inode and inode buffer locks > > > > > > down+0x43/0x60 > > > xfs_buf_lock+0x29/0xa0 > > > xfs_buf_find+0x2c4/0x590 > > > xfs_buf_get_map+0x46/0x390 > > > xfs_buf_read_map+0x52/0x270 > > > xfs_trans_read_buf_map+0x128/0x2a0 > > > xfs_read_agf+0x87/0x110 > > > xfs_alloc_read_agf+0x34/0x1a0 > > > xfs_alloc_fix_freelist+0x3d7/0x4f0 > > > xfs_alloc_vextent+0x22b/0x440 > > > __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0 > > > __xfs_btree_split+0xf2/0x610 > > > xfs_btree_split+0x4b/0x100 > > > xfs_btree_make_block_unfull+0x193/0x1c0 > > > xfs_btree_insrec+0x4a9/0x5a0 > > > xfs_btree_insert+0xa8/0x1f0 > > > xfs_difree_finobt+0xa4/0x240 > > > xfs_difree+0x126/0x1a0 > > > xfs_ifree+0xca/0x4a0 > > > xfs_inactive_ifree.isra.0+0x9e/0x1a0 > > > xfs_inactive+0xf8/0x170 > > > xfs_inodegc_worker+0x73/0xf0 > > > process_one_work+0x1e6/0x380 > > > worker_thread+0x50/0x3a0 > > > kthread+0x127/0x150 > > > ret_from_fork+0x22/0x30 > > > > > > 1 stuck on inode buffer, holding inode lock, holding AGF > > > > > > down+0x43/0x60 > > > xfs_buf_lock+0x29/0xa0 > > > xfs_buf_find+0x2c4/0x590 > > > xfs_buf_get_map+0x46/0x390 > > > xfs_buf_read_map+0x52/0x270 > > > xfs_trans_read_buf_map+0x128/0x2a0 > > > xfs_imap_to_bp+0x4e/0x70 > > > xfs_trans_log_inode+0x1d0/0x280 > > > xfs_bmap_btalloc+0x75f/0x820 > > > xfs_bmapi_allocate+0xe4/0x310 > > > xfs_bmapi_convert_delalloc+0x273/0x490 > > > xfs_map_blocks+0x1b5/0x400 > > > iomap_do_writepage+0x11d/0x820 > > > write_cache_pages+0x189/0x3e0 > > > iomap_writepages+0x1c/0x40 > > > xfs_vm_writepages+0x71/0xa0 > > > do_writepages+0xc3/0x1e0 > > > __writeback_single_inode+0x37/0x270 > > > writeback_sb_inodes+0x1ed/0x420 > > > __writeback_inodes_wb+0x4c/0xd0 > > > wb_writeback+0x1ba/0x270 > > > wb_workfn+0x292/0x4d0 > > > process_one_work+0x1e6/0x380 > > > worker_thread+0x50/0x3a0 > > > kthread+0x127/0x150 > > > ret_from_fork+0x22/0x30 > > > > That's the deadlock right there. > > > > task 1 task 2 > > lock inode A lock inode A+1 > > lock AGF B for allocation lock inode cluster > > remove inode from unlinked list > > free inode > > mark inode free in finobt > > insert new rec > > split tree > > lock AGF B for allocation > > <blocks waiting for task 1> > > <allocate blocks> > > > > xfs_bmap_finish > > log inode > > lock inode cluster buffer > > <blocks waiting for task 2> > > > > So this has nothing to do with background inode inactivation. It may > > have made it easier to hit, but it's definitely not *caused* by > > background inodegc as these two operations have always been able to > > run concurrently. > > > > The likely cause is going to be the async memory reclaim work from > > late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer > > to the inode log item") added the xfs_imap_to_bp() call to > > xfs_trans_log_inode() to pin the inode cluster buffer in memory when > > the inode was first dirtied. > > > > Essentially, the problem is that inode unlink list manipulations are > > not consistently ordered with inode allocation/freeing, hence not > > consistently ordered against AGI and AGF locking. I didn't realise > > that there was an AGF component to this problem, otherwise I would > > have sent this patch upstream much sooner: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a > > > > That commit is dated August 2020 - about the same time that the > > async memory reclaim stuff was merged. What this will do is: > > > > task 1 task 2 > > lock inode A lock inode A+1 > > lock AGF B for allocation > > free inode > > mark inode free in finobt > > insert new rec > > split tree > > lock AGF B for allocation > > <blocks waiting for task 1> > > <allocate blocks> > > > > xfs_bmap_finish > > log inode > > lock inode cluster buffer > > attach inode > > unlock inode cluster buffer > > xfs_trans_commit > > ... > > unlock AGF B > > <unblocks, holds AGF B> > > <allocate blocks> > > <completes split> > > <completes insert> > > <completes ifree> > > lock inode cluster buffer > > remove inode from unlinked list > > xfs_trans_commit > > ... > > unlock AGF B > > unlock inode cluster buffer > > > > And so the deadlock should go away. > > > > I've attached the current patch from my local dev tree below. Can > > you try it and see if the problem goes away? > > > > Cheers, > > > > Dave. > > > > Just a quick note on this. While I haven't come across the reported > self-deadlock (yet?), consistent lock ordering is obviously a Good Thing > and I'd like to see this applied, so that it can be merged into 5.15+. > It applies cleanly and has not caused any unexpected new issues during > almost two weeks of continuous use. I just hit this for the first time ever, so I guess I'll add this patch and see if anything shakes out. --D > > thanks, > Holger