On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > If the current transaction holds a busy extent and we are trying to > allocate a new extent to fix up the free list, we can deadlock if > the AG is entirely empty except for the busy extent held by the > transaction. > > This can occur at runtime processing an XEFI with multiple extents > in this path: > > __schedule+0x22f at ffffffff81f75e8f > schedule+0x46 at ffffffff81f76366 > xfs_extent_busy_flush+0x69 at ffffffff81477d99 > xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a > xfs_alloc_ag_vextent+0x19b at ffffffff81417edb > xfs_alloc_fix_freelist+0x22f at ffffffff8141896f > xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a > __xfs_free_extent+0x99 at ffffffff81419499 > xfs_trans_free_extent+0x3e at ffffffff814a6fee > xfs_extent_free_finish_item+0x24 at ffffffff814a70d4 > xfs_defer_finish_noroll+0x1f7 at ffffffff81441407 > xfs_defer_finish+0x11 at ffffffff814417e1 > xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd > xfs_inactive_truncate+0xb9 at ffffffff8148bb89 > xfs_inactive+0x227 at ffffffff8148c4f7 > xfs_fs_destroy_inode+0xb8 at ffffffff81496898 > destroy_inode+0x3b at ffffffff8127d2ab > do_unlinkat+0x1d1 at ffffffff81270df1 > do_syscall_64+0x40 at ffffffff81f6b5f0 > entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c > > This can also happen in log recovery when processing an EFI > with multiple extents through this path: > > context_switch() kernel/sched/core.c:3881 > __schedule() kernel/sched/core.c:5111 > schedule() kernel/sched/core.c:5186 > xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 > xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 > xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 > xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 > xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 > __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 > xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 > xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 > xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 > xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 > xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 > xfs_log_mount_finish() fs/xfs/xfs_log.c:764 > xfs_mountfs() fs/xfs/xfs_mount.c:978 > xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 > mount_bdev() fs/super.c:1417 > xfs_fs_mount() fs/xfs/xfs_super.c:1985 > legacy_get_tree() fs/fs_context.c:647 > vfs_get_tree() fs/super.c:1547 > do_new_mount() fs/namespace.c:2843 > do_mount() fs/namespace.c:3163 > ksys_mount() fs/namespace.c:3372 > __do_sys_mount() fs/namespace.c:3386 > __se_sys_mount() fs/namespace.c:3383 > __x64_sys_mount() fs/namespace.c:3383 > do_syscall_64() arch/x86/entry/common.c:296 > entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 > > To avoid this deadlock, we should not block in > xfs_extent_busy_flush() if we hold a busy extent in the current > transaction. > > Now that the EFI processing code can handle requeuing a partially > completed EFI, we can detect this situation in > xfs_extent_busy_flush() and return -EAGAIN rather than going to > sleep forever. The -EAGAIN get propagated back out to the > xfs_trans_free_extent() context, where the EFD is populated and the > transaction is rolled, thereby moving the busy extents into the CIL. > > At this point, we can retry the extent free operation again with a > clean transaction. If we hit the same "all free extents are busy" > situation when trying to fix up the free list, we can safely call > xfs_extent_busy_flush() and wait for the busy extents to resolve > and wake us. At this point, the allocation search can make progress > again and we can fix up the free list. > > This deadlock was first reported by Chandan in mid-2021, but I > couldn't make myself understood during review, and didn't have time > to fix it myself. > > It was reported again in March 2023, and again I have found myself > unable to explain the complexities of the solution needed during > review. > > As such, I don't have hours more time to waste trying to get the > fix written the way it needs to be written, so I'm just doing it > myself. This patchset is largely based on Wengang Wang's last patch, > but with all the unnecessary stuff removed, split up into multiple > patches and cleaned up somewhat. > > Reported-by: Chandan Babu R <chandanrlinux@xxxxxxxxx> > Reported-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++----------- > fs/xfs/libxfs/xfs_alloc.h | 11 ++++--- > fs/xfs/xfs_extent_busy.c | 33 ++++++++++++++++--- > fs/xfs/xfs_extent_busy.h | 6 ++-- > 4 files changed, 88 insertions(+), 30 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index 11bd0a1756a1..7c675aae0a0f 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near( > if (args->agbno > args->max_agbno) > args->agbno = args->max_agbno; > > + /* Retry once quickly if we find busy extents before blocking. */ > + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; > restart: > len = 0; > > @@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near( > */ > if (!acur.len) { > if (acur.busy) { > + /* > + * Our only valid extents must have been busy. Flush and > + * retry the allocation again. If we get an -EAGAIN > + * error, we're being told that a deadlock was avoided > + * and the current transaction needs committing before > + * the allocation can be retried. > + */ > trace_xfs_alloc_near_busy(args); > - xfs_extent_busy_flush(args->mp, args->pag, > - acur.busy_gen, alloc_flags); > + error = xfs_extent_busy_flush(args->tp, args->pag, > + acur.busy_gen, alloc_flags); > + if (error) > + goto out; > + > + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; > goto restart; > } > trace_xfs_alloc_size_neither(args); > @@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size( > int error; > int i; > > + /* Retry once quickly if we find busy extents before blocking. */ > + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; > restart: > /* > * Allocate and initialize a cursor for the by-size btree. > @@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size( > error = xfs_btree_increment(cnt_cur, 0, &i); > if (error) > goto error0; > - if (i == 0) { > - /* > - * Our only valid extents must have been busy. > - * Make it unbusy by forcing the log out and > - * retrying. > - */ > - xfs_btree_del_cursor(cnt_cur, > - XFS_BTREE_NOERROR); > - trace_xfs_alloc_size_busy(args); > - xfs_extent_busy_flush(args->mp, args->pag, > - busy_gen, alloc_flags); > - goto restart; > - } > + if (i) > + continue; > + > + /* > + * Our only valid extents must have been busy. Flush and > + * retry the allocation again. If we get an -EAGAIN > + * error, we're being told that a deadlock was avoided > + * and the current transaction needs committing before > + * the allocation can be retried. > + */ > + trace_xfs_alloc_size_busy(args); > + error = xfs_extent_busy_flush(args->tp, args->pag, > + busy_gen, alloc_flags); > + if (error) > + goto error0; > + > + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; > + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > + goto restart; > } > } > > @@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size( > args->len = rlen; > if (rlen < args->minlen) { > if (busy) { > - xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > + /* > + * Our only valid extents must have been busy. Flush and > + * retry the allocation again. If we get an -EAGAIN > + * error, we're being told that a deadlock was avoided > + * and the current transaction needs committing before > + * the allocation can be retried. > + */ > trace_xfs_alloc_size_busy(args); > - xfs_extent_busy_flush(args->mp, args->pag, busy_gen, > - alloc_flags); > + error = xfs_extent_busy_flush(args->tp, args->pag, > + busy_gen, alloc_flags); > + if (error) > + goto error0; > + > + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; > + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > goto restart; > } > goto out_nominleft; > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h > index d1aa7c63eafe..f267842e36ba 100644 > --- a/fs/xfs/libxfs/xfs_alloc.h > +++ b/fs/xfs/libxfs/xfs_alloc.h > @@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp); > /* > * Flags for xfs_alloc_fix_freelist. > */ > -#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */ > -#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/ > -#define XFS_ALLOC_FLAG_NORMAP 0x00000004 /* don't modify the rmapbt */ > -#define XFS_ALLOC_FLAG_NOSHRINK 0x00000008 /* don't shrink the freelist */ > -#define XFS_ALLOC_FLAG_CHECK 0x00000010 /* test only, don't modify args */ > +#define XFS_ALLOC_FLAG_TRYLOCK (1U << 0) /* use trylock for buffer locking */ > +#define XFS_ALLOC_FLAG_FREEING (1U << 1) /* indicate caller is freeing extents*/ > +#define XFS_ALLOC_FLAG_NORMAP (1U << 2) /* don't modify the rmapbt */ > +#define XFS_ALLOC_FLAG_NOSHRINK (1U << 3) /* don't shrink the freelist */ > +#define XFS_ALLOC_FLAG_CHECK (1U << 4) /* test only, don't modify args */ > +#define XFS_ALLOC_FLAG_TRYFLUSH (1U << 5) /* don't wait in busy extent flush */ > > /* > * Argument structure for xfs_alloc routines. > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > index 5f44936eae1c..7c2fdc71e42d 100644 > --- a/fs/xfs/xfs_extent_busy.c > +++ b/fs/xfs/xfs_extent_busy.c > @@ -566,10 +566,21 @@ xfs_extent_busy_clear( > > /* > * Flush out all busy extents for this AG. > + * > + * If the current transaction is holding busy extents, the caller may not want > + * to wait for committed busy extents to resolve. If we are being told just to > + * try a flush or progress has been made since we last skipped a busy extent, > + * return immediately to allow the caller to try again. > + * > + * If we are freeing extents, we might actually be holding the only free extents > + * in the transaction busy list and the log force won't resolve that situation. > + * In this case, we must return -EAGAIN to avoid a deadlock by informing the > + * caller it needs to commit the busy extents it holds before retrying the > + * extent free operation. > */ > -void > +int > xfs_extent_busy_flush( > - struct xfs_mount *mp, > + struct xfs_trans *tp, > struct xfs_perag *pag, > unsigned busy_gen, > uint32_t alloc_flags) > @@ -577,10 +588,23 @@ xfs_extent_busy_flush( > DEFINE_WAIT (wait); > int error; > > - error = xfs_log_force(mp, XFS_LOG_SYNC); > + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); > if (error) > - return; > + return error; > > + /* Avoid deadlocks on uncommitted busy extents. */ > + if (!list_empty(&tp->t_busy)) { > + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) > + return 0; > + > + if (busy_gen != READ_ONCE(pag->pagb_gen)) > + return 0; > + > + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) > + return -EAGAIN; > + } In the case where a task is freeing an ondisk inode, an ifree transaction can invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and the next call to free its immediate parent block. The first call to __xfs_inobt_free_block() adds the freed extent into the transaction's busy list and also into the per-ag rb tree tracking the busy extent. Freeing the second inobt block could lead to the following sequence of function calls, __xfs_free_extent() => xfs_free_extent_fix_freelist() => xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size() Here, xfs_extent_busy_flush() is invoked with XFS_ALLOC_FLAG_TRYFLUSH flag set. xfs_extent_busy_flush() flushes the CIL and returns to xfs_alloc_ag_vextent_size() since XFS_ALLOC_FLAG_TRYFLUSH was set. xfs_alloc_ag_vextent_size() invokes xfs_extent_busy_flush() once again; albeit without XFS_ALLOC_FLAG_TRYFLUSH flag set. This time xfs_extent_busy_flush() returns -EAGAIN since XFS_ALLOC_FLAG_FREEING flag is set. This will ultimate cause xfs_inactive_ifree() to shutdown the filesystem and cancel a dirty transaction. Please note that the above narrative could be incorrect. However, I have browsed the source code multiple times and am yet to find any evidence to the contrary. > + > + /* Wait for committed busy extents to resolve. */ > do { > prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); > if (busy_gen != READ_ONCE(pag->pagb_gen)) > @@ -589,6 +613,7 @@ xfs_extent_busy_flush( > } while (1); > > finish_wait(&pag->pagb_wait, &wait); > + return 0; > } > > void > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h > index 7a82c689bfa4..c37bf87e6781 100644 > --- a/fs/xfs/xfs_extent_busy.h > +++ b/fs/xfs/xfs_extent_busy.h > @@ -51,9 +51,9 @@ bool > xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, > xfs_extlen_t *len, unsigned *busy_gen); > > -void > -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, > - unsigned busy_gen, uint32_t alloc_flags); > +int > +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag, > + unsigned busy_gen, uint32_t alloc_flags); > > void > xfs_extent_busy_wait_all(struct xfs_mount *mp); -- chandan