> On May 22, 2023, at 5:31 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Mon, May 22, 2023 at 05:25:01PM +0000, Wengang Wang wrote: >> >> >>> On May 20, 2023, at 2:56 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >>> >>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote: >>>> The following calltrace is seen: >>>> #0 context_switch() kernel/sched/core.c:3881 >>>> #1 __schedule() kernel/sched/core.c:5111 >>>> #2 schedule() kernel/sched/core.c:5186 >>>> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 >>>> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 >>>> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 >>>> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 >>>> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 >>>> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 >>>> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 >>>> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 >>>> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 >>>> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 >>>> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 >>>> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 >>>> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 >>>> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 >>>> #17 mount_bdev() fs/super.c:1417 >>>> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 >>>> #19 legacy_get_tree() fs/fs_context.c:647 >>>> #20 vfs_get_tree() fs/super.c:1547 >>>> #21 do_new_mount() fs/namespace.c:2843 >>>> #22 do_mount() fs/namespace.c:3163 >>>> #23 ksys_mount() fs/namespace.c:3372 >>>> #24 __do_sys_mount() fs/namespace.c:3386 >>>> #25 __se_sys_mount() fs/namespace.c:3383 >>>> #26 __x64_sys_mount() fs/namespace.c:3383 >>>> #27 do_syscall_64() arch/x86/entry/common.c:296 >>>> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 >>>> >>>> During the process of the 2nd and subsequetial record in an EFI. >>>> It is waiting for the busy blocks to be cleaned, but the only busy extent >>>> is still hold in current xfs_trans->t_busy. That busy extent was added when >>>> processing previous EFI record. And because that busy extent is not committed >>>> yet, it can't be cleaned. >>>> >>>> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when >>>> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN >>>> we are able to retry that EFI record with a new transaction after committing >>>> the old transactin. With old transaction committed, the busy extent attached >>>> to the old transaction get the change to be cleaned. On the retry, there is >>>> no existing busy extents in the new transaction, thus no deadlock. >>>> >>>> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> >>>> --- >>>> fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------ >>>> fs/xfs/libxfs/xfs_alloc.h | 2 ++ >>>> fs/xfs/scrub/repair.c | 4 ++-- >>>> fs/xfs/xfs_extent_busy.c | 34 +++++++++++++++++++++++++++++----- >>>> fs/xfs/xfs_extent_busy.h | 6 +++--- >>>> fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++- >>>> fs/xfs/xfs_log_recover.c | 23 ++++++++++------------- >>>> fs/xfs/xfs_trans_ail.c | 2 +- >>>> fs/xfs/xfs_trans_priv.h | 1 + >>>> 9 files changed, 108 insertions(+), 31 deletions(-) >>>> >>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c >>>> index 203f16c48c19..abfd2acb3053 100644 >>>> --- a/fs/xfs/libxfs/xfs_alloc.c >>>> +++ b/fs/xfs/libxfs/xfs_alloc.c >>>> @@ -1491,6 +1491,7 @@ STATIC int >>>> xfs_alloc_ag_vextent_near( >>>> struct xfs_alloc_arg *args) >>>> { >>>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH; >>>> struct xfs_alloc_cur acur = {}; >>>> int error; /* error code */ >>>> int i; /* result code, temporary */ >>>> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near( >>>> if (!acur.len) { >>>> if (acur.busy) { >>>> trace_xfs_alloc_near_busy(args); >>>> - xfs_extent_busy_flush(args->mp, args->pag, >>>> - acur.busy_gen); >>>> + error = xfs_extent_busy_flush(args->tp, args->pag, >>>> + acur.busy_gen, flags); >>>> + if (error) >>>> + goto out; >>>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; >>>> goto restart; >>>> } >>>> trace_xfs_alloc_size_neither(args); >>>> @@ -1592,6 +1596,7 @@ STATIC int /* error */ >>>> xfs_alloc_ag_vextent_size( >>>> xfs_alloc_arg_t *args) /* allocation argument structure */ >>>> { >>>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH; >>> >>> This variable holds XFS_ALLOC_FLAG_* values that are passed only to >>> xfs_extent_busy_flush, right? Could this be named "busyflags" or >>> similar to make that clearer? >>> >>> (Clearer to 6hr^W6mo from now Darrick who won't remember this at all.) >> >> Yes, the ‘flags’ in only passed to xfs_extent_busy_flush(). will be ‘busyflags' >> in next drop. > > Ok. > >>> >>>> struct xfs_agf *agf = args->agbp->b_addr; >>>> struct xfs_btree_cur *bno_cur; /* cursor for bno btree */ >>>> struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */ >>>> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size( >>>> xfs_btree_del_cursor(cnt_cur, >>>> XFS_BTREE_NOERROR); >>>> trace_xfs_alloc_size_busy(args); >>>> - xfs_extent_busy_flush(args->mp, >>>> - args->pag, busy_gen); >>>> + error = xfs_extent_busy_flush(args->tp, args->pag, >>>> + busy_gen, flags); >>>> + if (error) { >>>> + cnt_cur = NULL; >>>> + goto error0; >>>> + } >>>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; >>>> goto restart; >>>> } >>>> } >>>> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size( >>>> if (busy) { >>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >>>> trace_xfs_alloc_size_busy(args); >>>> - xfs_extent_busy_flush(args->mp, args->pag, busy_gen); >>>> + error = xfs_extent_busy_flush(args->tp, args->pag, >>>> + busy_gen, flags); >>>> + if (error) { >>>> + cnt_cur = NULL; >>>> + goto error0; >>>> + } >>>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; >>>> goto restart; >>>> } >>>> goto out_nominleft; >>>> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist( >>>> targs.agno = args->agno; >>>> targs.alignment = targs.minlen = targs.prod = 1; >>>> targs.pag = pag; >>>> + targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING; >>> >>> Hmm, I guess this propagates FREEING into the allocation that lengthens >>> the AGFL. Right? >> >> Yes, the original call path is like this: >> xfs_free_extent_fix_freelist() >> xfs_alloc_fix_freelist(args, XFS_ALLOC_FLAG_FREEING as flags) >> xfs_alloc_ag_vextent_size(targs) >> xfs_extent_busy_flush(tp, pag, busy_gen, flags) >> >> We don’t have a ‘flags’ parameter to xfs_alloc_ag_vextent_size(). >> Considering the args its self is used for allocation, the ‘flags’ is also for the same >> purpose. I moved the ‘flags’ into args (struct xfs_alloc_arg) and thus >> xfs_alloc_fix_freelist() doesn’t need the ‘flags’ parameter any longer since >> the 'flags' is in in ‘args’. In my draft code (not posted to this email list), I >> removed the ‘flags’ paramter from xfs_alloc_fix_freelist(). While when I was >> reviewing the code change, I found that the ‘flags’ is well used in that function, >> accessing 'args->flags' instead of ‘flags’ may caused some extra CPU cycles. >> So in the patch (for review) I kept the ‘flags’ parameter which cause the code >> ugly :D > > I prefer either passing @flags around as a function parameter like we > always have; or stuffing it in xfs_alloc_arg. Not both, that makes it > much harder to understand. > Sure, so will use the flags in xfs_alloc_arg. >>> >>>> error = xfs_alloc_read_agfl(pag, tp, &agflbp); >>>> if (error) >>>> goto out_agbp_relse; >>>> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist( >>>> args.mp = tp->t_mountp; >>>> args.agno = pag->pag_agno; >>>> args.pag = pag; >>>> + args.flags = XFS_ALLOC_FLAG_FREEING; >>>> >>>> /* >>>> * validate that the block number is legal - the enables us to detect >>>> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist( >>>> if (args.agno >= args.mp->m_sb.sb_agcount) >>>> return -EFSCORRUPTED; >>>> >>>> - error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING); >>>> + error = xfs_alloc_fix_freelist(&args, args.flags); >>> >>> Hm. Why stuff XFS_ALLOC_FLAG_FREEING into args.flags here? I /think/ >>> in this case it doesn't matter because nothing under >>> xfs_alloc_fix_freelist accesses args->flags...? >>> >>> AFAICT in xfs_alloc_fix_freelist, the difference between @args->flags >>> and @flags is that @args is the allocation that we're trying to do, >>> whereas @flags are for any allocation that might need to be done to fix >>> the freelist. IOWS, the only allocation we're doing *is* to fix the >>> freelist, so they are one in the same. Maybe? Otherwise I can't tell >>> why we'd convey two sets of XFS_ALLOC flags to xfs_alloc_fix_freelist. >>> >>> So while I think this isn't incorrect, the flag mixing bothers me a bit >>> because setting fields in a struct can have weird not so obvious side >>> effects. >>> >>> (Still trying to make sense of the addition of a @flags field to struct >>> xfs_alloc_arg.) >> >> Right, see my above explain. >> >>> >>>> if (error) >>>> return error; >>>> >>>> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h >>>> index 2b246d74c189..5038fba87784 100644 >>>> --- a/fs/xfs/libxfs/xfs_alloc.h >>>> +++ b/fs/xfs/libxfs/xfs_alloc.h >>>> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp); >>>> #define XFS_ALLOC_FLAG_NORMAP 0x00000004 /* don't modify the rmapbt */ >>>> #define XFS_ALLOC_FLAG_NOSHRINK 0x00000008 /* don't shrink the freelist */ >>>> #define XFS_ALLOC_FLAG_CHECK 0x00000010 /* test only, don't modify args */ >>>> +#define XFS_ALLOC_FLAG_TRYFLUSH 0x00000020 /* don't block in busyextent flush*/ >>>> >>>> /* >>>> * Argument structure for xfs_alloc routines. >>>> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg { >>>> #ifdef DEBUG >>>> bool alloc_minlen_only; /* allocate exact minlen extent */ >>>> #endif >>>> + int flags; /* XFS_ALLOC_FLAG_* */ >>>> } xfs_alloc_arg_t; >>>> >>>> /* >>>> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c >>>> index 1b71174ec0d6..2ba28e4257fe 100644 >>>> --- a/fs/xfs/scrub/repair.c >>>> +++ b/fs/xfs/scrub/repair.c >>>> @@ -496,9 +496,9 @@ xrep_fix_freelist( >>>> args.agno = sc->sa.pag->pag_agno; >>>> args.alignment = 1; >>>> args.pag = sc->sa.pag; >>>> + args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK; >>>> >>>> - return xfs_alloc_fix_freelist(&args, >>>> - can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK); >>>> + return xfs_alloc_fix_freelist(&args, args.flags); >>>> } >>>> >>>> /* >>>> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c >>>> index f3d328e4a440..ea1c1857bf5b 100644 >>>> --- a/fs/xfs/xfs_extent_busy.c >>>> +++ b/fs/xfs/xfs_extent_busy.c >>>> @@ -567,18 +567,41 @@ xfs_extent_busy_clear( >>>> /* >>>> * Flush out all busy extents for this AG. >>>> */ >>>> -void >>>> +int >>>> xfs_extent_busy_flush( >>>> - struct xfs_mount *mp, >>>> + struct xfs_trans *tp, >>>> struct xfs_perag *pag, >>>> - unsigned busy_gen) >>>> + unsigned busy_gen, >>>> + int flags) >>>> { >>>> DEFINE_WAIT (wait); >>>> int error; >>>> >>>> - error = xfs_log_force(mp, XFS_LOG_SYNC); >>>> + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); >>>> if (error) >>>> - return; >>>> + return error; >>>> + >>>> + /* >>>> + * If we are holding busy extents, the caller may not want to block >>>> + * straight away. If we are being told just to try a flush or progress >>>> + * has been made since we last skipped a busy extent, return >>>> + * immediately to allow the caller to try again. If we are freeing >>>> + * extents, we might actually be holding the only free extents in the >>>> + * transaction busy list and the log force won't resolve that >>>> + * situation. In this case, return -EAGAIN in that case to tell the >>>> + * caller it needs to commit the busy extents it holds before retrying >>>> + * the extent free operation. >>>> + */ >>>> + if (!list_empty(&tp->t_busy)) { >>>> + if (flags & XFS_ALLOC_FLAG_TRYFLUSH) >>>> + return 0; >>>> + >>>> + if (busy_gen != READ_ONCE(pag->pagb_gen)) >>>> + return 0; >>>> + >>>> + if (flags & XFS_ALLOC_FLAG_FREEING) >>>> + return -EAGAIN; >>>> + } >>>> >>>> do { >>>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); >>>> @@ -588,6 +611,7 @@ xfs_extent_busy_flush( >>>> } while (1); >>>> >>>> finish_wait(&pag->pagb_wait, &wait); >>>> + return 0; >>> >>> This part looks good. >>> >>>> } >>>> >>>> void >>>> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h >>>> index 4a118131059f..edeedb92e0df 100644 >>>> --- a/fs/xfs/xfs_extent_busy.h >>>> +++ b/fs/xfs/xfs_extent_busy.h >>>> @@ -51,9 +51,9 @@ bool >>>> xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, >>>> xfs_extlen_t *len, unsigned *busy_gen); >>>> >>>> -void >>>> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, >>>> - unsigned busy_gen); >>>> +int >>>> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag, >>>> + unsigned busy_gen, int flags); >>>> >>>> void >>>> xfs_extent_busy_wait_all(struct xfs_mount *mp); >>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >>>> index 011b50469301..3c5a9e9952ec 100644 >>>> --- a/fs/xfs/xfs_extfree_item.c >>>> +++ b/fs/xfs/xfs_extfree_item.c >>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd( >>>> return efdp; >>>> } >>>> >>>> +/* >>>> + * Fill the EFD with all extents from the EFI and set the counter. >>>> + * Note: the EFD should comtain at least one extents already. >>>> + */ >>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp) >>>> +{ >>>> + struct xfs_efi_log_item *efip = efdp->efd_efip; >>>> + uint i; >>>> + >>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents) >>>> + return; >>>> + >>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) { >>>> + efdp->efd_format.efd_extents[i] = >>>> + efip->efi_format.efi_extents[i]; >>>> + } >>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents; >>>> +} >>>> + >>>> /* >>>> * Free an extent and log it to the EFD. Note that the transaction is marked >>>> * dirty regardless of whether the extent free succeeds or fails to support the >>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent( >>>> error = __xfs_free_extent(tp, xefi->xefi_startblock, >>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE, >>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); >>>> + if (error == -EAGAIN) { >>>> + xfs_fill_efd_with_efi(efdp); >>>> + return error; >>>> + } >>>> /* >>>> * Mark the transaction dirty, even on error. This ensures the >>>> * transaction is aborted, which: >>>> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item( >>>> xefi = container_of(item, struct xfs_extent_free_item, xefi_list); >>>> >>>> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); >>>> - kmem_cache_free(xfs_extfree_item_cache, xefi); >>>> + if (error != -EAGAIN) >>>> + kmem_cache_free(xfs_extfree_item_cache, xefi); >>>> return error; >>>> } >>>> >>>> @@ -633,6 +657,17 @@ xfs_efi_item_recover( >>>> fake.xefi_blockcount = extp->ext_len; >>>> >>>> error = xfs_trans_free_extent(tp, efdp, &fake); >>>> + if (error == -EAGAIN) { >>> >>> Curious, is this patch based off of 6.4-rc? There should be a >>> xfs_extent_free_get_group / xfs_extent_free_put_group here. >> >> It’s still 6.3.0-rc6 — the newest when started to work on the deadlock issue. >> Will rebase to current version for next drop. > > Ok. > >>> >>>> + xfs_free_extent_later(tp, fake.xefi_startblock, >>>> + fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER); >>>> + /* >>>> + * try to free as many extents as possible with current >>>> + * transaction >>>> + */ >>>> + error = 0; >>>> + continue; >>>> + }; >>> >>> If we get EAGAIN, why not let the unfinished parts of the recovered EFI >>> work get turned into a new EFI? >> >> That is an optimization. I wanted finish more records with current transaction >> when possible. Say for this case (we hit the first EAGAIN on (AG1, bno2, len2): >> EFI {(AG1, bno1, len1), (AG1, bno2, len2), (AG2, bno3, len3), (AG2, bno4, len4)} >> with current patch, we need 1 extra EFI, so >> orig EFI {(AG1, bno1, len1), (AG2, bno3, len3)} # records shown are really done ones >> new EFI1 {(AG1, bno2, len2), (AG2, bno4, len4)} >> >> Otherwise if we move all the records (from current one) to new EFI on the >> first EAGAIN, we would need >> orig EFI {(AG1, bno1, len1)} >> new EFI1 {(AG1, bno2, len2), (AG2, bno3, len3)} >> new EFI3 {(AG2, bno4, len4)} >> >> In above case, we saved one EFI with the optimization. > > The trouble here is that now we're performing updates during log > recovery in a different order than they would have been done by the > regular fs code. I /think/ for EFIs this doesn't matter (it definitely > matters for the other intent items!) but we (maintainers) prefer not to > have to think about two different orderings. OK. > >>> >>>> + >>>> if (error == -EFSCORRUPTED) >>>> XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, >>>> extp, sizeof(*extp)); >>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c >>>> index 322eb2ee6c55..00bfe9683fa8 100644 >>>> --- a/fs/xfs/xfs_log_recover.c >>>> +++ b/fs/xfs/xfs_log_recover.c >>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents( >>>> struct xfs_log_item *lip; >>>> struct xfs_ail *ailp; >>>> int error = 0; >>>> -#if defined(DEBUG) || defined(XFS_WARN) >>>> - xfs_lsn_t last_lsn; >>>> -#endif >>>> + xfs_lsn_t threshold_lsn; >>>> >>>> ailp = log->l_ailp; >>>> + threshold_lsn = xfs_ail_max_lsn(ailp); >>>> spin_lock(&ailp->ail_lock); >>>> -#if defined(DEBUG) || defined(XFS_WARN) >>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block); >>>> -#endif >>>> + >>>> for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0); >>>> lip != NULL; >>>> lip = xfs_trans_ail_cursor_next(ailp, &cur)) { >>>> const struct xfs_item_ops *ops; >>>> + /* >>>> + * Orignal redo EFI could be splitted into new EFIs. Those >>>> + * new EFIs are supposed to be processed in capture_list. >>>> + * Stop here when original redo intents are done. >>>> + */ >>>> + if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0) >>>> + break; >>> >>> I'm not sure what this change accomplishes? Is it because of the >>> continue; statement in xfs_efi_item_recover? >> >> The function xlog_recover_process_intents() iterates all the intent >> items in AIL and process them. According the the comments: >> >> 2519 * When this is called, all of the log intent items which did not have >> 2520 * corresponding log done items should be in the AIL. What we do now is update >> 2521 * the data structures associated with each one. >> >> I am thinking xlog_recover_process_intents should iterates only the ones >> which are already on the AIL. If we don’t make a threshold here, it will continue >> to process new intents added when we are processing existing intents. In this >> case, we add new EFIs and that runs fast to be added to AIL in callbacks before >> we finish the iterations here. And then the new EFIs are captured by >> xlog_recover_process_intents(). If xlog_recover_process_intents() continue to >> process these new EFIs, it will double free extents in EFI with xlog_finish_defer_ops() >> thus cause problem. > > Each ->iop_recover function is supposed to do at least one unit of work, > which should generate new non-intent log items. I am not quite sure for above. There are cases that new intents are added in iop_recover(), for example xfs_attri_item_recover(): 632 error = xfs_xattri_finish_update(attr, done_item); 633 if (error == -EAGAIN) { 634 /* 635 * There's more work to do, so add the intent item to this 636 * transaction so that we can continue it later. 637 */ 638 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); 639 error = xfs_defer_ops_capture_and_commit(tp, capture_list); 640 if (error) 641 goto out_unlock; 642 643 xfs_iunlock(ip, XFS_ILOCK_EXCL); 644 xfs_irele(ip); 645 return 0; 646 } I am thinking line 638 and 639 are doing so. > These new items are > attached to the end of the AIL's list before any new intent items > created to handle an EAGAIN return. Hence the list order ought to be: > > <recovered EFI> -> <buffer log items for bnobt updates> -> > <new EFI to continue because busy flush didn't move busy gen> > > Are you seeing an order other than this? log looks good. Just it seems that we should fix xlog_recover_process_intents() not to run iop_recover() for those items with lsn equal to or bigger than last_lsn to avoid the double free issue. please see also my reply to Dave about the double free. thanks, wengang