On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote: > On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote: > > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote: > > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote: > > > > Is it to be expected that removing 29TB of highly reflinked and fragmented > > > > data could take days, the entire time blocking other tasks like "rm" and > > > > "df" on the same filesystem? > ... > > > From a product POV, I think what should have happened here is that > > > freeing up the space would have taken 10 days in the background, but > > > otherwise, filesystem should not have been blocking other processes > > > for long periods of time. Sure, that's obvious, and without looking at the code I know what that is: statfs() > > Indeed. Chris, do you happen to have the sysrq-w output handy? I'm > > curious if the stall warning backtraces all had xfs_inodegc_flush() in > > them, or were there other parts of the system stalling elsewhere too? > > 50 billion updates is a lot, but there shouldn't be stall warnings. > > Sure: https://file.io/25za5BNBlnU8 (6.8M) > > Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush(). yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk processes, there's a couple of veeamagent processes stuck and an (ostnamed) process, whatever that is... No real surprises there, and it's not why the filesystem is taking so long to remove all the reflink references. There is just one background inodegc worker thread running, so there's no excessive load being generated by inodegc, either. It's no different to a single rm running xfs_inactive() directly on a slightly older kernel and filling the journal: May 06 09:49:01 d5 kernel: task:kworker/6:1 state:D stack: 0 pid:23258 ppid: 2 flags:0x00004000 May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs] May 06 09:49:01 d5 kernel: Call Trace: May 06 09:49:01 d5 kernel: <TASK> May 06 09:49:01 d5 kernel: __schedule+0x241/0x740 May 06 09:49:01 d5 kernel: schedule+0x3a/0xa0 May 06 09:49:01 d5 kernel: schedule_timeout+0x271/0x310 May 06 09:49:01 d5 kernel: __down+0x6c/0xa0 May 06 09:49:01 d5 kernel: down+0x3b/0x50 May 06 09:49:01 d5 kernel: xfs_buf_lock+0x40/0xe0 [xfs] May 06 09:49:01 d5 kernel: xfs_buf_find.isra.32+0x3ee/0x730 [xfs] May 06 09:49:01 d5 kernel: xfs_buf_get_map+0x3c/0x430 [xfs] May 06 09:49:01 d5 kernel: xfs_buf_read_map+0x37/0x2c0 [xfs] May 06 09:49:01 d5 kernel: xfs_trans_read_buf_map+0x1cb/0x300 [xfs] May 06 09:49:01 d5 kernel: xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs] May 06 09:49:01 d5 kernel: xfs_btree_lookup_get_block+0x85/0x150 [xfs] May 06 09:49:01 d5 kernel: xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs] May 06 09:49:01 d5 kernel: xfs_btree_query_range+0xd5/0x100 [xfs] May 06 09:49:01 d5 kernel: xfs_rmap_query_range+0x71/0x80 [xfs] May 06 09:49:01 d5 kernel: xfs_rmap_lookup_le_range+0x88/0x180 [xfs] May 06 09:49:01 d5 kernel: xfs_rmap_unmap_shared+0x89/0x560 [xfs] May 06 09:49:01 d5 kernel: xfs_rmap_finish_one+0x201/0x260 [xfs] May 06 09:49:01 d5 kernel: xfs_rmap_update_finish_item+0x33/0x60 [xfs] May 06 09:49:01 d5 kernel: xfs_defer_finish_noroll+0x215/0x5a0 [xfs] May 06 09:49:01 d5 kernel: xfs_defer_finish+0x13/0x70 [xfs] May 06 09:49:01 d5 kernel: xfs_itruncate_extents_flags+0xc4/0x240 [xfs] May 06 09:49:01 d5 kernel: xfs_inactive_truncate+0x7f/0xc0 [xfs] May 06 09:49:01 d5 kernel: xfs_inactive+0x10c/0x130 [xfs] May 06 09:49:01 d5 kernel: xfs_inodegc_worker+0xb5/0x140 [xfs] May 06 09:49:01 d5 kernel: process_one_work+0x2a8/0x4c0 May 06 09:49:01 d5 kernel: worker_thread+0x21b/0x3c0 May 06 09:49:01 d5 kernel: kthread+0x121/0x140 There are 18 tasks in destroy_inode() blocked on a workqueue flush - these are new unlinks that are getting throttled because that per-cpu inodegc queue is full and work is ongoing. Not a huge deal, maybe we should look to hand full queues to another CPU if the neighbour CPU has an empty queue. That would reduce unnecessary throttling, though it may mean more long running inodegc processes in the background.... Really, though, I don't see anything deadlocked, just a system backed up doing a really large amount of metadata modification. Everything is sitting on throttles or waiting on IO and making slow forwards progress as metadata writeback allows log space to be freed. I think we should just accept that statfs() can never really report exactly accurate space usagei to userspace and get rid of the flush. Work around the specific fstests dependencies on rm always immediately making unlinked inodes free space in fstests rather than in the kernel code. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx