Re: [PATCHv2 1/1] xfs: Add cond_resched in xfs_bunmapi_range loop

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Tue, 30 Apr 2024 10:34:24 +0530

"Darrick J. Wong" <djwong@xxxxxxxxxx> writes:

> On Mon, Apr 29, 2024 at 02:14:46PM +0530, Ritesh Harjani wrote:
>> "Ritesh Harjani (IBM)" <ritesh.list@xxxxxxxxx> writes:
>> 
>> > An async dio write to a sparse file can generate a lot of extents
>> > and when we unlink this file (using rm), the kernel can be busy in umapping
>> > and freeing those extents as part of transaction processing.
>> > Add cond_resched() in xfs_bunmapi_range() to avoid soft lockups
>> > messages like these. Here is a call trace of such a soft lockup.
>> >
>> > watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
>> > CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
>> > Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
>> > NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
>> > LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
>> > Call Trace:
>> >   xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
>> >   xfs_alloc_compute_aligned+0x5c/0x144
>> >   xfs_alloc_ag_vextent_size+0x238/0x8d4
>> >   xfs_alloc_fix_freelist+0x540/0x694
>> >   xfs_free_extent_fix_freelist+0x84/0xe0
>> >   __xfs_free_extent+0x74/0x1ec
>> >   xfs_extent_free_finish_item+0xcc/0x214
>> >   xfs_defer_finish_one+0x194/0x388
>> >   xfs_defer_finish_noroll+0x1b4/0x5c8
>> >   xfs_defer_finish+0x2c/0xc4
>> >   xfs_bunmapi_range+0xa4/0x100
>> >   xfs_itruncate_extents_flags+0x1b8/0x2f4
>> >   xfs_inactive_truncate+0xe0/0x124
>> >   xfs_inactive+0x30c/0x3e0
>> >   xfs_inodegc_worker+0x140/0x234
>> >   process_scheduled_works+0x240/0x57c
>> >   worker_thread+0x198/0x468
>> >   kthread+0x138/0x140
>> >   start_kernel_thread+0x14/0x18
>> >
>> 
>> My v1 patch had cond_resched() in xfs_defer_finish_noroll, since I was
>> suspecting that it's a common point where we loop for many other
>> operations. And initially Dave also suggested for the same [1].
>> But I was not totally convinced given the only problematic path I
>> had till now was in unmapping extents. So this patch keeps the
>> cond_resched() in xfs_bunmapi_range() loop.
>> 
>> [1]: https://lore.kernel.org/all/ZZ8OaNnp6b%2FPJzsb@xxxxxxxxxxxxxxxxxxx/
>> 
>> However, I was able to reproduce a problem with reflink remapping path
>> both on Power (with 64k bs) and on x86 (with preempt=none and with KASAN
>> enabled). I actually noticed while I was doing regression testing of
>> some of the iomap changes with KASAN enabled. The issue was seen with
>> generic/175 for both on Power and x86.
>> 
>> Do you think we should keep the cond_resched() inside
>> xfs_defer_finish_noroll() loop like we had in v1 [2]. If yes, then I can rebase
>> v1 on the latest upstream tree and also update the commit msg with both
>> call stacks.
>
> I think there ought to be one in xfs_defer_finish_noroll (or even
> xfs_trans_roll) when we're between dirty transactions, since long
> running transaction chains can indeed stall.  That'll hopefully solve
> the problem for bunmapi and long running online repair operations.
>
> But that said, the main FICLONE loop continually creates new transaction
> chains, so you probably need one there too.

We need not right, as long as we have one in xfs_defer_finish_noroll(),
because we commit everything before returning from
xfs_reflink_remap_extent(). And we do have XFS_TRANS_PERM_LOG_RES set,
so xfs_defer_finish_noroll() should always be called. 
(Apologies, if I am not making sense. I new to XFS code)

I gave generic/175 an overnight run with cond_resched() call inside
xfs_defer_finish_noroll() loop. I couldn't reproduce the soft lockup bug.

So does it sound ok to have cond_resched() within
xfs_defer_finish_noroll(). I can submit the new version then.

This should solve the soft lockup problems in both unmapping of extents
causing large transaction chains and also reflink remapping path.

> Hence the grumblings about cond_resched whackamole.
> Thoughts?

Yes, I agree that life will be easy if we don't have to play with the
hacks of putting such calls within tight loops. But until we have such
infrastructure ready, I was hoping we could find a correct fix for such
problems for current upstream and old stable kernels.

Thanks for the review and suggestions!

-ritesh