From: Dave Chinner <dchinner@xxxxxxxxxx> If the current transaction holds a busy extent and we are trying to allocate a new extent to fix up the free list, we can deadlock if the AG is entirely empty except for the busy extent held by the transaction. This can occur at runtime processing an XEFI with multiple extents in this path: __schedule+0x22f at ffffffff81f75e8f schedule+0x46 at ffffffff81f76366 xfs_extent_busy_flush+0x69 at ffffffff81477d99 xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a xfs_alloc_ag_vextent+0x19b at ffffffff81417edb xfs_alloc_fix_freelist+0x22f at ffffffff8141896f xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a __xfs_free_extent+0x99 at ffffffff81419499 xfs_trans_free_extent+0x3e at ffffffff814a6fee xfs_extent_free_finish_item+0x24 at ffffffff814a70d4 xfs_defer_finish_noroll+0x1f7 at ffffffff81441407 xfs_defer_finish+0x11 at ffffffff814417e1 xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd xfs_inactive_truncate+0xb9 at ffffffff8148bb89 xfs_inactive+0x227 at ffffffff8148c4f7 xfs_fs_destroy_inode+0xb8 at ffffffff81496898 destroy_inode+0x3b at ffffffff8127d2ab do_unlinkat+0x1d1 at ffffffff81270df1 do_syscall_64+0x40 at ffffffff81f6b5f0 entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c This can also happen in log recovery when processing an EFI with multiple extents through this path: context_switch() kernel/sched/core.c:3881 __schedule() kernel/sched/core.c:5111 schedule() kernel/sched/core.c:5186 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 xfs_mountfs() fs/xfs/xfs_mount.c:978 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 mount_bdev() fs/super.c:1417 xfs_fs_mount() fs/xfs/xfs_super.c:1985 legacy_get_tree() fs/fs_context.c:647 vfs_get_tree() fs/super.c:1547 do_new_mount() fs/namespace.c:2843 do_mount() fs/namespace.c:3163 ksys_mount() fs/namespace.c:3372 __do_sys_mount() fs/namespace.c:3386 __se_sys_mount() fs/namespace.c:3383 __x64_sys_mount() fs/namespace.c:3383 do_syscall_64() arch/x86/entry/common.c:296 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 To avoid this deadlock, we should not block in xfs_extent_busy_flush() if we hold a busy extent in the current transaction. Now that the EFI processing code can handle requeuing a partially completed EFI, we can detect this situation in xfs_extent_busy_flush() and return -EAGAIN rather than going to sleep forever. The -EAGAIN get propagated back out to the xfs_trans_free_extent() context, where the EFD is populated and the transaction is rolled, thereby moving the busy extents into the CIL. At this point, we can retry the extent free operation again with a clean transaction. If we hit the same "all free extents are busy" situation when trying to fix up the free list, we can safely call xfs_extent_busy_flush() and wait for the busy extents to resolve and wake us. At this point, the allocation search can make progress again and we can fix up the free list. This deadlock was first reported by Chandan in mid-2021, but I couldn't make myself understood during review, and didn't have time to fix it myself. It was reported again in March 2023, and again I have found myself unable to explain the complexities of the solution needed during review. As such, I don't have hours more time to waste trying to get the fix written the way it needs to be written, so I'm just doing it myself. This patchset is largely based on Wengang Wang's last patch, but with all the unnecessary stuff removed, split up into multiple patches and cleaned up somewhat. Reported-by: Chandan Babu R <chandanrlinux@xxxxxxxxx> Reported-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --- fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++----------- fs/xfs/libxfs/xfs_alloc.h | 11 ++++--- fs/xfs/xfs_extent_busy.c | 33 ++++++++++++++++--- fs/xfs/xfs_extent_busy.h | 6 ++-- 4 files changed, 88 insertions(+), 30 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 11bd0a1756a1..7c675aae0a0f 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near( if (args->agbno > args->max_agbno) args->agbno = args->max_agbno; + /* Retry once quickly if we find busy extents before blocking. */ + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; restart: len = 0; @@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near( */ if (!acur.len) { if (acur.busy) { + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ trace_xfs_alloc_near_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, - acur.busy_gen, alloc_flags); + error = xfs_extent_busy_flush(args->tp, args->pag, + acur.busy_gen, alloc_flags); + if (error) + goto out; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; goto restart; } trace_xfs_alloc_size_neither(args); @@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size( int error; int i; + /* Retry once quickly if we find busy extents before blocking. */ + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; restart: /* * Allocate and initialize a cursor for the by-size btree. @@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size( error = xfs_btree_increment(cnt_cur, 0, &i); if (error) goto error0; - if (i == 0) { - /* - * Our only valid extents must have been busy. - * Make it unbusy by forcing the log out and - * retrying. - */ - xfs_btree_del_cursor(cnt_cur, - XFS_BTREE_NOERROR); - trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, - busy_gen, alloc_flags); - goto restart; - } + if (i) + continue; + + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ + trace_xfs_alloc_size_busy(args); + error = xfs_extent_busy_flush(args->tp, args->pag, + busy_gen, alloc_flags); + if (error) + goto error0; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); + goto restart; } } @@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size( args->len = rlen; if (rlen < args->minlen) { if (busy) { - xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, busy_gen, - alloc_flags); + error = xfs_extent_busy_flush(args->tp, args->pag, + busy_gen, alloc_flags); + if (error) + goto error0; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); goto restart; } goto out_nominleft; diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index d1aa7c63eafe..f267842e36ba 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp); /* * Flags for xfs_alloc_fix_freelist. */ -#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */ -#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/ -#define XFS_ALLOC_FLAG_NORMAP 0x00000004 /* don't modify the rmapbt */ -#define XFS_ALLOC_FLAG_NOSHRINK 0x00000008 /* don't shrink the freelist */ -#define XFS_ALLOC_FLAG_CHECK 0x00000010 /* test only, don't modify args */ +#define XFS_ALLOC_FLAG_TRYLOCK (1U << 0) /* use trylock for buffer locking */ +#define XFS_ALLOC_FLAG_FREEING (1U << 1) /* indicate caller is freeing extents*/ +#define XFS_ALLOC_FLAG_NORMAP (1U << 2) /* don't modify the rmapbt */ +#define XFS_ALLOC_FLAG_NOSHRINK (1U << 3) /* don't shrink the freelist */ +#define XFS_ALLOC_FLAG_CHECK (1U << 4) /* test only, don't modify args */ +#define XFS_ALLOC_FLAG_TRYFLUSH (1U << 5) /* don't wait in busy extent flush */ /* * Argument structure for xfs_alloc routines. diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index 5f44936eae1c..7c2fdc71e42d 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -566,10 +566,21 @@ xfs_extent_busy_clear( /* * Flush out all busy extents for this AG. + * + * If the current transaction is holding busy extents, the caller may not want + * to wait for committed busy extents to resolve. If we are being told just to + * try a flush or progress has been made since we last skipped a busy extent, + * return immediately to allow the caller to try again. + * + * If we are freeing extents, we might actually be holding the only free extents + * in the transaction busy list and the log force won't resolve that situation. + * In this case, we must return -EAGAIN to avoid a deadlock by informing the + * caller it needs to commit the busy extents it holds before retrying the + * extent free operation. */ -void +int xfs_extent_busy_flush( - struct xfs_mount *mp, + struct xfs_trans *tp, struct xfs_perag *pag, unsigned busy_gen, uint32_t alloc_flags) @@ -577,10 +588,23 @@ xfs_extent_busy_flush( DEFINE_WAIT (wait); int error; - error = xfs_log_force(mp, XFS_LOG_SYNC); + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); if (error) - return; + return error; + /* Avoid deadlocks on uncommitted busy extents. */ + if (!list_empty(&tp->t_busy)) { + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) + return 0; + + if (busy_gen != READ_ONCE(pag->pagb_gen)) + return 0; + + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) + return -EAGAIN; + } + + /* Wait for committed busy extents to resolve. */ do { prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); if (busy_gen != READ_ONCE(pag->pagb_gen)) @@ -589,6 +613,7 @@ xfs_extent_busy_flush( } while (1); finish_wait(&pag->pagb_wait, &wait); + return 0; } void diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index 7a82c689bfa4..c37bf87e6781 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -51,9 +51,9 @@ bool xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, xfs_extlen_t *len, unsigned *busy_gen); -void -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen, uint32_t alloc_flags); +int +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag, + unsigned busy_gen, uint32_t alloc_flags); void xfs_extent_busy_wait_all(struct xfs_mount *mp); -- 2.40.1