Hi Darrick, do you have further comments? thanks, wengang > On May 12, 2023, at 3:20 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: > > > >> On May 12, 2023, at 2:13 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >> >> On Fri, May 12, 2023 at 07:55:28PM +0000, Wengang Wang wrote: >>> Hi Darrick, >>> >>>> On May 12, 2023, at 11:24 AM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >>>> >>>> Sorry for the slow response, I thought Dave would've responded by now. >>>> >>>> On Mon, Apr 24, 2023 at 03:51:02PM -0700, Wengang Wang wrote: >>>>> To avoid possible deadlock when allocating AGFL blocks, change the behaviour. >>>>> The orignal hehaviour for freeing extents in an EFI is that the extents in >>>>> the EFI were processed one by one in the same transaction. When the second >>>>> and subsequent extents are being processed, we have produced busy extents for >>>>> previous extents. If the second and subsequent extents are from the same AG >>>>> as the busy extents are, we have the risk of deadlock when allocating AGFL >>>>> blocks. A typical calltrace for the deadlock is like this: >>>> >>>> It's been a few weeks, so let me try to remember what this is all about. >>>> You have one EFI containing multiple extents to free: >>>> >>>> {agno: 3, agbno: X, ... } >>>> {agno: 3, agbno: Y, ... } >>>> >>>> AG 3 is nearly full, so we free the first extent (3/X), which adds it to >>>> the bnobt, records the newly freed extent in the extentbusylist, and >>>> attaches the record to the transaction's busy extent list. >>>> >>>> Then we move on to the second record in the EFI item. We want to free >>>> (3/Y), but whoops the AGFL isn't big enough to handle maximal expansion >>>> of the bnobt/cntbt btrees. >>>> >>>> So we try to fix the AG 3 AGFL to be long enough. We can find the space >>>> in the bnobt, but the only space available is the (3/X) that we put >>>> there during the last step. That space is still(!) busy and still >>>> attached to the transaction, so the CIL cannot clear it. The AGFL fixer >>>> sees that the space is busy, so it waits for it... and now we're dead in >>>> the water. >>>> >>>> The key to this deadlock is a thread waiting on its own uncommitted busy >>>> extent, right? >>>> >>> >>> Yep, above is all correct. >>> >>>>> >>>>> #0 context_switch() kernel/sched/core.c:3881 >>>>> #1 __schedule() kernel/sched/core.c:5111 >>>>> #2 schedule() kernel/sched/core.c:5186 >>>>> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 >>>>> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 >>>>> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 >>>>> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 >>>>> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 >>>>> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 >>>>> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 >>>>> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 >>>>> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 >>>>> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 >>>>> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 >>>>> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 >>>>> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 >>>>> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 >>>>> #17 mount_bdev() fs/super.c:1417 >>>>> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 >>>>> #19 legacy_get_tree() fs/fs_context.c:647 >>>>> #20 vfs_get_tree() fs/super.c:1547 >>>>> #21 do_new_mount() fs/namespace.c:2843 >>>>> #22 do_mount() fs/namespace.c:3163 >>>>> #23 ksys_mount() fs/namespace.c:3372 >>>>> #24 __do_sys_mount() fs/namespace.c:3386 >>>>> #25 __se_sys_mount() fs/namespace.c:3383 >>>>> #26 __x64_sys_mount() fs/namespace.c:3383 >>>>> #27 do_syscall_64() arch/x86/entry/common.c:296 >>>>> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 >>>>> >>>>> The deadlock could happen at both IO time and log recover time. >>>>> >>>>> To avoid above deadlock, this patch changes the extent free procedure. >>>>> 1) it always let the first extent from the EFI go (no change). >>>>> 2) increase the (new) AG counter when it let a extent go. >>>>> 3) for the 2nd+ extents, if the owning AGs ready have pending extents >>>>> don't let the extent go with current transaction. Instead, move the >>>>> extent in question and subsequent extents to a new EFI and try the new >>>>> EFI again with new transaction (by rolling current transaction). >>>>> 4) for the EFD to orginal EFI, fill it with all the extents from the original >>>>> EFI. >>>>> 5) though the new EFI is placed after original EFD, it's safe as they are in >>>>> same in-memory transaction. >>>>> 6) The new AG counter for pending extent freeings is decremented after the >>>>> log items in in-memory transaction is committed to CIL. >>>> >>>> So the solution you're proposing is a per-AG counter of however many >>>> threads have started an EFI free. If the second (or subsequent) EFI >>>> free encounters an AG with a nonzero counter, it'll return EAGAIN to >>>> cycle the transaction. The cycling is critical to getting the busy >>>> extent to the CIL so the log can process it and make it unbusy. If the >>>> AG wasn't contended (pag_nr_pending_extents was 0), it proceeds with the >>>> freeing, having bumped the per-AG counter. Right? >>> >>> Right. >>> >>>> >>>> Therefore, the above deadlock sequence now becomes: >>>> >>>> 1. Free (3/X), placing (3/X) on the transaction's busy list. >>>> >>>> 2. Start trying to free (3/Y), notice that AG 3 has elevated >>>> pag_nr_pending_extents, return EAGAIN >>>> >>>> 3. Roll transaction, which moves (3/X) to the CIL and gets us a fresh >>>> transaction >>>> >>>> 4. Try to free (3/Y) again >>>> >>>> At step 4, if pag_nr_pending_extents is still elevated, does that >>>> mean we return EAGAIN and keep rolling the transaction until it's not >>>> elevated? Shouldn't we be able to proceed at step 4, even if we end up >>>> waiting on the log to clear (3/X) from the busy list? >>> >>> The pag_nr_pending_extents is expected to be decremented in step 3. >> >> OH. I missed (and you pointed out below) the subtlety that step 3 gives >> us a fresh transaction /and/ a fresh EFI, so after the roll, (3/Y) is >> the "first" EFI and never has to wait. > > Yes exactly. > >> >>> The changes in xlog_cil_insert_items() is supposed to do so. So at >>> step 4, for this cass {3/x, 3/y}, pag_nr_pending_extents should be >>> decremented to zero, thus step 4 can go (after that busy extent 3/x is >>> cleared). >> >> So even if the log is being slow and the CIL hasn't cleared the busy >> extent (3/X) by the time we finish rolling the tx and relogging the EFI, >> the relogged EFI will start with (3/Y) and move forward without waiting. >> Got it. > > Correct. There is no waiting while we start to process the EFI records, > The point here is that we process the records in the existing transaction or > in new transactions (rolling current one). > >> >>>> >>>> What happens if the log actually clears (3/X) from the busy list after >>>> step 3 but then some other thread starts processing an EFI for (3/Z)? >>> >>> There is a rule in the patch that the first record in an EFI can always go (bumping >>> the per AG counter). That’s because the freeing the first record never has >>> the deadlock issue. >>> >>> The 3/Y would go because now the 3/Y is the first record of the new EFI >>> containing only record of 3/Y. >> >> Oh, ok. I missed that. >> >>> For the other thread for 3/Z, 3/Z can go too because it’s the first record in >>> that EFI (bumping the per AG counter, now the counter could be 3 at most). >> >> <nod> >> >>> >>>> Does that cause this thread to roll repeatedly waiting for another >>>> thread's EFI to clear? >>> >>> Nope. As I said, the first record can always go. For the 2nd and subsequent >>> record, on retry, they would become the first record in the new FEI and it goes. >> >> <nod> >> >>>> >>>> IOWs, I'm not sure we can always make forward progress with this scheme, >>>> and I'd prefer to avoid introducing more global state. >>> >>> See above explanation. >>> >>>> >>>>> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> >>>>> --- >>>>> fs/xfs/libxfs/xfs_ag.c | 1 + >>>>> fs/xfs/libxfs/xfs_ag.h | 5 ++ >>>>> fs/xfs/xfs_extfree_item.c | 111 +++++++++++++++++++++++++++++++++++++- >>>>> fs/xfs/xfs_log_cil.c | 24 ++++++++- >>>>> 4 files changed, 138 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c >>>>> index 86696a1c6891..61ef61e05668 100644 >>>>> --- a/fs/xfs/libxfs/xfs_ag.c >>>>> +++ b/fs/xfs/libxfs/xfs_ag.c >>>>> @@ -378,6 +378,7 @@ xfs_initialize_perag( >>>>> pag->pagb_tree = RB_ROOT; >>>>> #endif /* __KERNEL__ */ >>>>> >>>>> + atomic_set(&pag->pag_nr_pending_extents, 0); >>>>> error = xfs_buf_hash_init(pag); >>>>> if (error) >>>>> goto out_remove_pag; >>>>> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h >>>>> index 5e18536dfdce..5950bc36a32c 100644 >>>>> --- a/fs/xfs/libxfs/xfs_ag.h >>>>> +++ b/fs/xfs/libxfs/xfs_ag.h >>>>> @@ -82,6 +82,11 @@ struct xfs_perag { >>>>> uint16_t pag_sick; >>>>> spinlock_t pag_state_lock; >>>>> >>>>> + /* >>>>> + * Number of concurrent extent freeings (not committed to CIL yet) >>>>> + * on this AG. >>>>> + */ >>>>> + atomic_t pag_nr_pending_extents; >>>>> spinlock_t pagb_lock; /* lock for pagb_tree */ >>>>> struct rb_root pagb_tree; /* ordered tree of busy extents */ >>>>> unsigned int pagb_gen; /* generation count for pagb_tree */ >>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >>>>> index 011b50469301..1dbf36d9c1c9 100644 >>>>> --- a/fs/xfs/xfs_extfree_item.c >>>>> +++ b/fs/xfs/xfs_extfree_item.c >>>>> @@ -336,6 +336,75 @@ xfs_trans_get_efd( >>>>> return efdp; >>>>> } >>>>> >>>>> +/* >>>>> + * Fill the EFD with all extents from the EFI and set the counter. >>>>> + * Note: the EFD should comtain at least one extents already. >>>>> + */ >>>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp) >>>>> +{ >>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip; >>>>> + uint i; >>>>> + >>>>> + i = efdp->efd_next_extent; >>>>> + ASSERT(i > 0); >>>>> + for (; i < efip->efi_format.efi_nextents; i++) { >>>>> + efdp->efd_format.efd_extents[i] = >>>>> + efip->efi_format.efi_extents[i]; >>>> >>>> Why is it necessary to fill the EFD mapping structures? Nobody /ever/ >>>> looks at those; the only part that matters to log recovery is matching >>>> efd.efd_efi_id to efi.efi_id. >>> >>> Yes, that’s for EFI/EFD matching for log recovery. >>> The records to be retried would be in a new EFI. >>> >>>> >>>> But, I guess it looks funny to requeue an EFI and have the EFD for the >>>> old EFI missing a bunch of fields. >>> >>> The EFD to the original EFI is not missing anything, it is expected >>> to be exact the same as if all the records are processed. >>> By this way we move records to new EFI(s) to avoid the deadlock. >>> >>>> >>>>> + } >>>>> + efdp->efd_next_extent = i; >>>>> +} >>>>> + >>>>> +/* >>>>> + * Check if xefi is the first in the efip. >>>>> + * Returns true if so, ad false otherwise >>>>> + */ >>>>> +static bool xfs_is_first_extent_in_efi(struct xfs_efi_log_item *efip, >>>>> + struct xfs_extent_free_item *xefi) >>>>> +{ >>>>> + return efip->efi_format.efi_extents[0].ext_start == >>>>> + xefi->xefi_startblock; >>>>> +} >>>>> + >>>>> +/* >>>>> + * Check if the xefi needs to be in a new transaction. >>>>> + * efip is the containing EFI of xefi. >>>>> + * Return true if so, false otherwise. >>>>> + */ >>>>> +static bool xfs_extent_free_need_new_trans(struct xfs_mount *mp, >>>>> + struct xfs_efi_log_item *efip, >>>>> + struct xfs_extent_free_item *xefi) >>>>> +{ >>>>> + bool ret = true; >>>>> + int nr_pre; >>>>> + xfs_agnumber_t agno; >>>>> + struct xfs_perag *pag; >>>>> + >>>>> + agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock); >>>>> + pag = xfs_perag_get(mp, agno); >>>>> + /* The first extent in EFI is always OK to go */ >>>>> + if (xfs_is_first_extent_in_efi(efip, xefi)) { >>>>> + atomic_inc(&pag->pag_nr_pending_extents); >>>>> + ret = false; >>>>> + goto out_put; >>>>> + } >>>>> + >>>>> + /* >>>>> + * Now the extent is the 2nd or subsequent in the efip. We need >>>>> + * new transaction if the AG already has busy extents pending. >>>>> + */ >>>>> + nr_pre = atomic_inc_return(&pag->pag_nr_pending_extents) - 1; >>>>> + /* No prevoius pending extent freeing to this AG */ >>>>> + if (nr_pre == 0) { >>>>> + ret = false; >>>>> + goto out_put; >>>>> + } >>>>> + >>>>> + atomic_dec(&pag->pag_nr_pending_extents); >>>>> +out_put: >>>>> + xfs_perag_put(pag); >>>>> + return ret; >>>>> +} >>>>> + >>>>> /* >>>>> * Free an extent and log it to the EFD. Note that the transaction is marked >>>>> * dirty regardless of whether the extent free succeeds or fails to support the >>>>> @@ -356,6 +425,28 @@ xfs_trans_free_extent( >>>>> xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp, >>>>> xefi->xefi_startblock); >>>>> int error; >>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip; >>>>> + >>>>> + if (xfs_extent_free_need_new_trans(mp, efip, xefi)) { >>>>> + /* >>>>> + * This should be the 2nd+ extent, we don't have to mark the >>>>> + * transaction and efd dirty, those are already done with the >>>>> + * first extent. >>>>> + */ >>>>> + ASSERT(tp->t_flags & XFS_TRANS_DIRTY); >>>>> + ASSERT(tp->t_flags & XFS_TRANS_HAS_INTENT_DONE); >>>>> + ASSERT(test_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags)); >>>>> + >>>>> + xfs_fill_efd_with_efi(efdp); >>>>> + >>>>> + /* >>>>> + * A preious extent in same AG is processed with the current >>>>> + * transaction. To avoid possible AGFL allocation deadlock, >>>>> + * we roll the transaction and then restart with this extent >>>>> + * with new transaction. >>>>> + */ >>>>> + return -EAGAIN; >>>>> + } >>>>> >>>>> oinfo.oi_owner = xefi->xefi_owner; >>>>> if (xefi->xefi_flags & XFS_EFI_ATTR_FORK) >>>>> @@ -369,6 +460,13 @@ xfs_trans_free_extent( >>>>> error = __xfs_free_extent(tp, xefi->xefi_startblock, >>>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE, >>>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); >>>>> + if (error) { >>>>> + struct xfs_perag *pag; >>>>> + >>>>> + pag = xfs_perag_get(mp, agno); >>>>> + atomic_dec(&pag->pag_nr_pending_extents); >>>>> + xfs_perag_put(pag); >>>>> + } >>>>> /* >>>>> * Mark the transaction dirty, even on error. This ensures the >>>>> * transaction is aborted, which: >>>>> @@ -476,7 +574,8 @@ xfs_extent_free_finish_item( >>>> >>>> This function comes with an unused **state variable: >>>> >>> >>> I don’t see why we need to take different action according to >>> the ‘state’. >>> >>>> STATIC int >>>> xfs_extent_free_finish_item( >>>> struct xfs_trans *tp, >>>> struct xfs_log_item *done, >>>> struct list_head *item, >>>> struct xfs_btree_cur **state) >>>> >>>> What if, after each xfs_trans_free_extent call, we stuff *state with >>>> xefi_startblock's AG? >>>> >>> >>> see Above. >>> >>>> Then, upon entry to xfs_extent_free_finish_item, we compare *state with >>>> the xefi item and return EAGAIN if we're processing an EFI from the same >>>> AG as the previous call to xfs_extent_free_finish_item? Something like >>>> this (totally untested): >>>> >>>> STATIC int >>>> xfs_extent_free_finish_item( >>>> struct xfs_trans *tp, >>>> struct xfs_log_item *done, >>>> struct list_head *item, >>>> struct xfs_btree_cur **state) >>>> { >>>> struct xfs_extent_free_item *xefi; >>>> int error; >>>> >>>> xefi = container_of(item, struct xfs_extent_free_item, xefi_list); >>>> >>>> if (*state) { >>>> struct xfs_perag *oldpag = *(struct xfs_perag **)state; >>>> >>>> /* >>>> * If we're freeing two extents from the same AG, we >>>> * must roll the transaction between the two extents to >>>> * avoid a livelock resulting from AGFL fixing waiting >>>> * on the extent that we just freed to become unbusy. >>>> */ >>>> if (oldpag == xefi->xefi_pag) { >>>> xfs_fill_efd_with_efi(EFD_ITEM(done)); >>>> xfs_perag_put(oldpag); >>>> return -EAGAIN; >>>> } >>> >>> As I said the first record would let go, we don’t care previous AG. >>> >>>> xfs_perag_put(oldpag); >>>> } >>>> >>>> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); >>>> if (!error) >>>> *state = (void *)xfs_perag_hold(xefi->xefi_pag); >>>> >>>> xfs_extent_free_put_group(xefi); >>>> kmem_cache_free(xfs_extfree_item_cache, xefi); >>>> return error; >>>> } >>>> >>>> Obviously, you now need a ->finish_cleanup function to drop the perag >>>> reference that may be stashed in @state. This (I think) avoids the >>>> livelock by ensuring that xfs_trans_free_extent always starts with a >>>> transaction that does not contain any busy extents from the same AG that >>>> it's acting on, right? >>> >>> Nope, we don’t need to pass perag in state. Still the rule: >>> >>> The first record in an EFI (no matter it’s the original EFI or new EFIs) just >>> goes (because it won’t hit the deadlock). >>> >>> For the 2nd or subsequent records, we need to see if there are any pending >>> busy_extent to the same AG as the record belongs to. “pending” means >>> those are still in xfs_trans->t_busy (not committed to xlog_cil_pcp->busy_extents >>> yet). We know that by checking the newly introduced per AG counter >>> pag_nr_pending_extents. If there are, we retry the record in a new EFI >>> in a new transaction after committing the original transaction. With in the new >>> EFI, the 2nd record in the original EFI become the 1st record, and it just can >>> go (because it won’t hit the deadlock). >> >> <nod> So in other words, the pag_nr_pending_extents counter buys us the >> performance advantage that we don't need to add the extra roll between >> (3/X) and (3/Y) if the log is running fast enough to clear the busy >> extent before we even start evaluating the relogged EFI (with 3/Y in >> it). > > I’d think there needs a transaction roll between (3/X) and (3/Y). If we don’t do that > we would have a “pending” busy extent in current xfs_trans->t_busy while processing > (3/Y). And that busy extent had no chance to be cleared because it’s not committed > to xlog_cil_pcp->busy_extents yet and CIL flushing won’t take care of busy extents > which are not in busy_extents. So if no transaction roll in (3/X) and (3/Y), we still > have the risk to the dead lock when allocating blocks for AGFL. > >> >> How often does it happen that we have to roll the transaction and relog >> the EFI? > > It could depend on the use case. This may happen when truncating files. > > >> >>> We are using per AG counter here to know if there are pending busy extents >>> rather than comparing with the involved previous AGs in the same EFI to avoid >>> also the potential ABBA dead lock: >>> thread 1: {3/X, 4/Y} >>> thread 2: {4/Z, 3/T} >>> >>>> >>>> A simpler version of the above would make it so that we always roll >>>> between EFI records and don't have to manage perag references: >>>> >>>> #define XFS_EFI_STATE_ROLL ((struct xfs_btree_cur *)1) >>>> STATIC int >>>> xfs_extent_free_finish_item( >>>> struct xfs_trans *tp, >>>> struct xfs_log_item *done, >>>> struct list_head *item, >>>> struct xfs_btree_cur **state) >>>> { >>>> struct xfs_extent_free_item *xefi; >>>> int error; >>>> >>>> /* >>>> * If we're freeing two extents, roll the transaction between >>>> * the two extents to avoid a livelock resulting from AGFL >>>> * fixing waiting on the extent that we just freed to become >>>> * unbusy. It's not necessary to roll if the previous and >>>> * current EFI record point to different AGs, but this >>>> * simplifies the state tracking. >>>> */ >>>> if (*state == XFS_EFI_STATE_ROLL) { >>>> xfs_fill_efd_with_efi(EFD_ITEM(done)); >>>> *state = NULL; >>>> return -EAGAIN; >>>> } >>>> >>>> xefi = container_of(item, struct xfs_extent_free_item, xefi_list); >>>> >>>> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); >>>> if (!error) >>>> *state = XFS_EFI_STATE_ROLL; >>>> >>>> xfs_extent_free_put_group(xefi); >>>> kmem_cache_free(xfs_extfree_item_cache, xefi); >>>> return error; >>>> } >>>> >>> >>> By this way, it looks to me that the original EFI is always splitted into several ones >>> each including only records. It’s the same result as we change >>> XFS_EFI_MAX_FAST_EXTENTS to 1. Dave doesn’t like that. >> >> Yes. >> >>> My implantation split them only when necessary as Dave suggested. >>> >>>> (Did you and Dave already talk about this?) >>> >>> There many discussion in this two threads: >>> [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents >>> [PATCH 1/2] xfs: IO time one extent per EFI >> >> Ok, I'll go reread all that. > > thanks, > wengang