Re: [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Mon, 22 Apr 2024 20:40:03 +0530

Dave Chinner <david@xxxxxxxxxxxxx> writes:

> On Sun, Apr 21, 2024 at 01:19:44PM +0530, Ritesh Harjani (IBM) wrote:
>> An async dio write to a sparse file can generate a lot of extents
>> and when we unlink this file (using rm), the kernel can be busy in umapping
>> and freeing those extents as part of transaction processing.
>> Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups
>> messages. Here is a call trace of such soft lockup.
>> 
>> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335]
>> CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G             L X    5.14.21-150500.53-default
>
> Can you reproduce this on a current TOT kernel? 5.14 is pretty old,
> and this stack trace:
>

Yes, I was able to reproduce this on upstream kernel too -

    watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
    CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
    Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
    NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
    LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
    Call Trace:
      xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
      xfs_alloc_compute_aligned+0x5c/0x144
      xfs_alloc_ag_vextent_size+0x238/0x8d4
      xfs_alloc_fix_freelist+0x540/0x694
      xfs_free_extent_fix_freelist+0x84/0xe0
      __xfs_free_extent+0x74/0x1ec
      xfs_extent_free_finish_item+0xcc/0x214
      xfs_defer_finish_one+0x194/0x388
      xfs_defer_finish_noroll+0x1b4/0x5c8
      xfs_defer_finish+0x2c/0xc4
      xfs_bunmapi_range+0xa4/0x100
      xfs_itruncate_extents_flags+0x1b8/0x2f4
      xfs_inactive_truncate+0xe0/0x124
      xfs_inactive+0x30c/0x3e0
      xfs_inodegc_worker+0x140/0x234
      process_scheduled_works+0x240/0x57c
      worker_thread+0x198/0x468
      kthread+0x138/0x140
      start_kernel_thread+0x14/0x18


>> NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs]
>> LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs]
>> Call Trace:
>>  0xc0000000a8268340 (unreliable)
>>  xfs_alloc_compute_aligned+0x5c/0x150 [xfs]
>>  xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs]
>>  xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs]
>>  xfs_alloc_fix_freelist+0x274/0x4b0 [xfs]
>>  xfs_free_extent_fix_freelist+0x84/0xe0 [xfs]
>>  __xfs_free_extent+0xa0/0x240 [xfs]
>>  xfs_trans_free_extent+0x6c/0x140 [xfs]
>>  xfs_defer_finish_noroll+0x2b0/0x650 [xfs]
>>  xfs_inactive_truncate+0xe8/0x140 [xfs]
>>  xfs_fs_destroy_inode+0xdc/0x320 [xfs]
>>  destroy_inode+0x6c/0xc0
>
> .... doesn't exist anymore.
>
> xfs_inactive_truncate() is now done from a
> background inodegc thread, not directly in destroy_inode().
>
> I also suspect that any sort of cond_resched() should be in the top
> layer loop in xfs_bunmapi_range(), not hidden deep in the defer
> code. The problem is the number of extents being processed without
> yielding, not the time spent processing each individual deferred
> work chain to free the extent. Hence the explicit rescheduling
> should be at the top level loop where it can be easily explained and
> understand, not hidden deep inside the defer chain mechanism....

Yes, sure. I will submit a v2 with this diff then (after I verify the
fix once on the setup, just for my sanity)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..44d5381bc66f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6354,6 +6354,7 @@ xfs_bunmapi_range(
                error = xfs_defer_finish(tpp);
                if (error)
                        goto out;
+               cond_resched();
        }
 out:
        return error;


-ritesh