Dave Chinner <david@xxxxxxxxxxxxx> writes: > On Sun, Apr 21, 2024 at 01:19:44PM +0530, Ritesh Harjani (IBM) wrote: >> An async dio write to a sparse file can generate a lot of extents >> and when we unlink this file (using rm), the kernel can be busy in umapping >> and freeing those extents as part of transaction processing. >> Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups >> messages. Here is a call trace of such soft lockup. >> >> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335] >> CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G L X 5.14.21-150500.53-default > > Can you reproduce this on a current TOT kernel? 5.14 is pretty old, > and this stack trace: > Yes, I was able to reproduce this on upstream kernel too - watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435] CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S L 6.9.0-rc5-0-default #1 Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290 LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290 Call Trace: xfs_alloc_get_rec+0x54/0x1b0 (unreliable) xfs_alloc_compute_aligned+0x5c/0x144 xfs_alloc_ag_vextent_size+0x238/0x8d4 xfs_alloc_fix_freelist+0x540/0x694 xfs_free_extent_fix_freelist+0x84/0xe0 __xfs_free_extent+0x74/0x1ec xfs_extent_free_finish_item+0xcc/0x214 xfs_defer_finish_one+0x194/0x388 xfs_defer_finish_noroll+0x1b4/0x5c8 xfs_defer_finish+0x2c/0xc4 xfs_bunmapi_range+0xa4/0x100 xfs_itruncate_extents_flags+0x1b8/0x2f4 xfs_inactive_truncate+0xe0/0x124 xfs_inactive+0x30c/0x3e0 xfs_inodegc_worker+0x140/0x234 process_scheduled_works+0x240/0x57c worker_thread+0x198/0x468 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 >> NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs] >> LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs] >> Call Trace: >> 0xc0000000a8268340 (unreliable) >> xfs_alloc_compute_aligned+0x5c/0x150 [xfs] >> xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs] >> xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs] >> xfs_alloc_fix_freelist+0x274/0x4b0 [xfs] >> xfs_free_extent_fix_freelist+0x84/0xe0 [xfs] >> __xfs_free_extent+0xa0/0x240 [xfs] >> xfs_trans_free_extent+0x6c/0x140 [xfs] >> xfs_defer_finish_noroll+0x2b0/0x650 [xfs] >> xfs_inactive_truncate+0xe8/0x140 [xfs] >> xfs_fs_destroy_inode+0xdc/0x320 [xfs] >> destroy_inode+0x6c/0xc0 > > .... doesn't exist anymore. > > xfs_inactive_truncate() is now done from a > background inodegc thread, not directly in destroy_inode(). > > I also suspect that any sort of cond_resched() should be in the top > layer loop in xfs_bunmapi_range(), not hidden deep in the defer > code. The problem is the number of extents being processed without > yielding, not the time spent processing each individual deferred > work chain to free the extent. Hence the explicit rescheduling > should be at the top level loop where it can be easily explained and > understand, not hidden deep inside the defer chain mechanism.... Yes, sure. I will submit a v2 with this diff then (after I verify the fix once on the setup, just for my sanity) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 656c95a22f2e..44d5381bc66f 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -6354,6 +6354,7 @@ xfs_bunmapi_range( error = xfs_defer_finish(tpp); if (error) goto out; + cond_resched(); } out: return error; -ritesh