Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 12 May 2023 14:13:26 -0700

On Fri, May 12, 2023 at 07:55:28PM +0000, Wengang Wang wrote:
> Hi Darrick,
> 
> > On May 12, 2023, at 11:24 AM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > 
> > Sorry for the slow response, I thought Dave would've responded by now.
> > 
> > On Mon, Apr 24, 2023 at 03:51:02PM -0700, Wengang Wang wrote:
> >> To avoid possible deadlock when allocating AGFL blocks, change the behaviour.
> >> The orignal hehaviour for freeing extents in an EFI is that the extents in
> >> the EFI were processed one by one in the same transaction. When the second
> >> and subsequent extents are being processed, we have produced busy extents for
> >> previous extents. If the second and subsequent extents are from the same AG
> >> as the busy extents are, we have the risk of deadlock when allocating AGFL
> >> blocks. A typical calltrace for the deadlock is like this:
> > 
> > It's been a few weeks, so let me try to remember what this is all about.
> > You have one EFI containing multiple extents to free:
> > 
> > {agno: 3, agbno: X, ... }
> > {agno: 3, agbno: Y, ... }
> > 
> > AG 3 is nearly full, so we free the first extent (3/X), which adds it to
> > the bnobt, records the newly freed extent in the extentbusylist, and
> > attaches the record to the transaction's busy extent list.
> > 
> > Then we move on to the second record in the EFI item.  We want to free
> > (3/Y), but whoops the AGFL isn't big enough to handle maximal expansion
> > of the bnobt/cntbt btrees.
> > 
> > So we try to fix the AG 3 AGFL to be long enough.  We can find the space
> > in the bnobt, but the only space available is the (3/X) that we put
> > there during the last step.  That space is still(!) busy and still
> > attached to the transaction, so the CIL cannot clear it.  The AGFL fixer
> > sees that the space is busy, so it waits for it... and now we're dead in
> > the water.
> > 
> > The key to this deadlock is a thread waiting on its own uncommitted busy
> > extent, right?
> > 
> 
> Yep, above is all correct.
> 
> >> 
> >> #0 context_switch() kernel/sched/core.c:3881
> >> #1 __schedule() kernel/sched/core.c:5111
> >> #2 schedule() kernel/sched/core.c:5186
> >> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
> >> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
> >> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
> >> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
> >> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
> >> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
> >> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
> >> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
> >> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
> >> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
> >> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
> >> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764
> >> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978
> >> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
> >> #17 mount_bdev() fs/super.c:1417
> >> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985
> >> #19 legacy_get_tree() fs/fs_context.c:647
> >> #20 vfs_get_tree() fs/super.c:1547
> >> #21 do_new_mount() fs/namespace.c:2843
> >> #22 do_mount() fs/namespace.c:3163
> >> #23 ksys_mount() fs/namespace.c:3372
> >> #24 __do_sys_mount() fs/namespace.c:3386
> >> #25 __se_sys_mount() fs/namespace.c:3383
> >> #26 __x64_sys_mount() fs/namespace.c:3383
> >> #27 do_syscall_64() arch/x86/entry/common.c:296
> >> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
> >> 
> >> The deadlock could happen at both IO time and log recover time.
> >> 
> >> To avoid above deadlock, this patch changes the extent free procedure.
> >> 1) it always let the first extent from the EFI go (no change).
> >> 2) increase the (new) AG counter when it let a extent go.
> >> 3) for the 2nd+ extents, if the owning AGs ready have pending extents
> >>   don't let the extent go with current transaction. Instead, move the
> >>   extent in question and subsequent extents to a new EFI and try the new
> >>   EFI again with new transaction (by rolling current transaction).
> >> 4) for the EFD to orginal EFI, fill it with all the extents from the original
> >>   EFI.
> >> 5) though the new EFI is placed after original EFD, it's safe as they are in
> >>   same in-memory transaction.
> >> 6) The new AG counter for pending extent freeings is decremented after the
> >>   log items in in-memory transaction is committed to CIL.
> > 
> > So the solution you're proposing is a per-AG counter of however many
> > threads have started an EFI free.  If the second (or subsequent) EFI
> > free encounters an AG with a nonzero counter, it'll return EAGAIN to
> > cycle the transaction.  The cycling is critical to getting the busy
> > extent to the CIL so the log can process it and make it unbusy.  If the
> > AG wasn't contended (pag_nr_pending_extents was 0), it proceeds with the
> > freeing, having bumped the per-AG counter.  Right?
> 
> Right.
> 
> > 
> > Therefore, the above deadlock sequence now becomes:
> > 
> > 1. Free (3/X), placing (3/X) on the transaction's busy list.
> > 
> > 2. Start trying to free (3/Y), notice that AG 3 has elevated
> > pag_nr_pending_extents, return EAGAIN
> > 
> > 3. Roll transaction, which moves (3/X) to the CIL and gets us a fresh
> > transaction
> > 
> > 4. Try to free (3/Y) again
> > 
> > At step 4, if pag_nr_pending_extents is still elevated, does that
> > mean we return EAGAIN and keep rolling the transaction until it's not
> > elevated?  Shouldn't we be able to proceed at step 4, even if we end up
> > waiting on the log to clear (3/X) from the busy list?
> 
> The pag_nr_pending_extents is expected to be decremented in step 3.

OH.  I missed (and you pointed out below) the subtlety that step 3 gives
us a fresh transaction /and/ a fresh EFI, so after the roll, (3/Y) is
the "first" EFI and never has to wait.

> The changes in xlog_cil_insert_items() is supposed to do so.  So at
> step 4, for this cass {3/x, 3/y}, pag_nr_pending_extents should be
> decremented to zero, thus step 4 can go (after that busy extent 3/x is
> cleared).

So even if the log is being slow and the CIL hasn't cleared the busy
extent (3/X) by the time we finish rolling the tx and relogging the EFI,
the relogged EFI will start with (3/Y) and move forward without waiting.
Got it.

> > 
> > What happens if the log actually clears (3/X) from the busy list after
> > step 3 but then some other thread starts processing an EFI for (3/Z)?
> 
> There is a rule in the patch that the first record in an EFI can always go (bumping
> the per AG counter). That’s because the freeing the first record never has
> the deadlock issue.
> 
> The 3/Y would go because now the 3/Y is the first record of the new EFI
> containing only record of 3/Y.

Oh, ok.  I missed that.

> For the other thread for 3/Z,  3/Z can go too because it’s the first record in
> that EFI (bumping the per AG counter, now the counter could be 3 at most).

<nod>

> 
> > Does that cause this thread to roll repeatedly waiting for another
> > thread's EFI to clear?
> 
> Nope. As I said, the first record can always go. For the 2nd and subsequent
> record, on retry, they would become the first record in the new FEI and it goes.

<nod>

> > 
> > IOWs, I'm not sure we can always make forward progress with this scheme,
> > and I'd prefer to avoid introducing more global state.
> 
> See above explanation.
> 
> > 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx>
> >> ---
> >> fs/xfs/libxfs/xfs_ag.c    |   1 +
> >> fs/xfs/libxfs/xfs_ag.h    |   5 ++
> >> fs/xfs/xfs_extfree_item.c | 111 +++++++++++++++++++++++++++++++++++++-
> >> fs/xfs/xfs_log_cil.c      |  24 ++++++++-
> >> 4 files changed, 138 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> >> index 86696a1c6891..61ef61e05668 100644
> >> --- a/fs/xfs/libxfs/xfs_ag.c
> >> +++ b/fs/xfs/libxfs/xfs_ag.c
> >> @@ -378,6 +378,7 @@ xfs_initialize_perag(
> >> pag->pagb_tree = RB_ROOT;
> >> #endif /* __KERNEL__ */
> >> 
> >> + atomic_set(&pag->pag_nr_pending_extents, 0);
> >> error = xfs_buf_hash_init(pag);
> >> if (error)
> >> goto out_remove_pag;
> >> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> >> index 5e18536dfdce..5950bc36a32c 100644
> >> --- a/fs/xfs/libxfs/xfs_ag.h
> >> +++ b/fs/xfs/libxfs/xfs_ag.h
> >> @@ -82,6 +82,11 @@ struct xfs_perag {
> >> uint16_t pag_sick;
> >> spinlock_t pag_state_lock;
> >> 
> >> + /*
> >> +  * Number of concurrent extent freeings (not committed to CIL yet)
> >> +  * on this AG.
> >> +  */
> >> + atomic_t pag_nr_pending_extents;
> >> spinlock_t pagb_lock; /* lock for pagb_tree */
> >> struct rb_root pagb_tree; /* ordered tree of busy extents */
> >> unsigned int pagb_gen; /* generation count for pagb_tree */
> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >> index 011b50469301..1dbf36d9c1c9 100644
> >> --- a/fs/xfs/xfs_extfree_item.c
> >> +++ b/fs/xfs/xfs_extfree_item.c
> >> @@ -336,6 +336,75 @@ xfs_trans_get_efd(
> >> return efdp;
> >> }
> >> 
> >> +/*
> >> + * Fill the EFD with all extents from the EFI and set the counter.
> >> + * Note: the EFD should comtain at least one extents already.
> >> + */
> >> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> >> +{
> >> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >> + uint i;
> >> +
> >> + i = efdp->efd_next_extent;
> >> + ASSERT(i > 0);
> >> + for (; i < efip->efi_format.efi_nextents; i++) {
> >> + efdp->efd_format.efd_extents[i] =
> >> + efip->efi_format.efi_extents[i];
> > 
> > Why is it necessary to fill the EFD mapping structures?  Nobody /ever/
> > looks at those; the only part that matters to log recovery is matching
> > efd.efd_efi_id to efi.efi_id.
> 
> Yes, that’s for EFI/EFD matching for log recovery.
> The records to be retried would be in a new EFI.
> 
> > 
> > But, I guess it looks funny to requeue an EFI and have the EFD for the
> > old EFI missing a bunch of fields.
> 
> The EFD to the original EFI is not missing anything, it is expected
> to be exact the same as if all the records are processed.
> By this way we move records to new EFI(s) to avoid the deadlock.
> 
> > 
> >> + }
> >> + efdp->efd_next_extent = i;
> >> +}
> >> +
> >> +/*
> >> + * Check if xefi is the first in the efip.
> >> + * Returns true if so, ad false otherwise
> >> + */
> >> +static bool xfs_is_first_extent_in_efi(struct xfs_efi_log_item *efip,
> >> +   struct xfs_extent_free_item *xefi)
> >> +{
> >> + return efip->efi_format.efi_extents[0].ext_start ==
> >> + xefi->xefi_startblock;
> >> +}
> >> +
> >> +/*
> >> + * Check if the xefi needs to be in a new transaction.
> >> + * efip is the containing EFI of xefi.
> >> + * Return true if so, false otherwise.
> >> + */
> >> +static bool xfs_extent_free_need_new_trans(struct xfs_mount *mp,
> >> +     struct xfs_efi_log_item *efip,
> >> +     struct xfs_extent_free_item *xefi)
> >> +{
> >> + bool ret = true;
> >> + int nr_pre;
> >> + xfs_agnumber_t agno;
> >> + struct xfs_perag *pag;
> >> +
> >> + agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
> >> + pag = xfs_perag_get(mp, agno);
> >> + /* The first extent in EFI is always OK to go */
> >> + if (xfs_is_first_extent_in_efi(efip, xefi)) {
> >> + atomic_inc(&pag->pag_nr_pending_extents);
> >> + ret = false;
> >> + goto out_put;
> >> + }
> >> +
> >> + /*
> >> +  * Now the extent is the 2nd or subsequent in the efip. We need
> >> +  * new transaction if the AG already has busy extents pending.
> >> +  */
> >> + nr_pre = atomic_inc_return(&pag->pag_nr_pending_extents) - 1;
> >> + /* No prevoius pending extent freeing to this AG */
> >> + if (nr_pre == 0) {
> >> + ret = false;
> >> + goto out_put;
> >> + }
> >> +
> >> + atomic_dec(&pag->pag_nr_pending_extents);
> >> +out_put:
> >> + xfs_perag_put(pag);
> >> + return ret;
> >> +}
> >> +
> >> /*
> >>  * Free an extent and log it to the EFD. Note that the transaction is marked
> >>  * dirty regardless of whether the extent free succeeds or fails to support the
> >> @@ -356,6 +425,28 @@ xfs_trans_free_extent(
> >> xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp,
> >> xefi->xefi_startblock);
> >> int error;
> >> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >> +
> >> + if (xfs_extent_free_need_new_trans(mp, efip, xefi)) {
> >> + /*
> >> +  * This should be the 2nd+ extent, we don't have to mark the
> >> +  * transaction and efd dirty, those are already done with the
> >> +  * first extent.
> >> +  */
> >> + ASSERT(tp->t_flags & XFS_TRANS_DIRTY);
> >> + ASSERT(tp->t_flags & XFS_TRANS_HAS_INTENT_DONE);
> >> + ASSERT(test_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags));
> >> +
> >> + xfs_fill_efd_with_efi(efdp);
> >> +
> >> + /*
> >> +  * A preious extent in same AG is processed with the current
> >> +  * transaction. To avoid possible AGFL allocation deadlock,
> >> +  * we roll the transaction and then restart with this extent
> >> +  * with new transaction.
> >> +  */
> >> + return -EAGAIN;
> >> + }
> >> 
> >> oinfo.oi_owner = xefi->xefi_owner;
> >> if (xefi->xefi_flags & XFS_EFI_ATTR_FORK)
> >> @@ -369,6 +460,13 @@ xfs_trans_free_extent(
> >> error = __xfs_free_extent(tp, xefi->xefi_startblock,
> >> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> >> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> >> + if (error) {
> >> + struct xfs_perag *pag;
> >> +
> >> + pag = xfs_perag_get(mp, agno);
> >> + atomic_dec(&pag->pag_nr_pending_extents);
> >> + xfs_perag_put(pag);
> >> + }
> >> /*
> >>  * Mark the transaction dirty, even on error. This ensures the
> >>  * transaction is aborted, which:
> >> @@ -476,7 +574,8 @@ xfs_extent_free_finish_item(
> > 
> > This function comes with an unused **state variable:
> > 
> 
> I don’t see why we need to take different action according to
> the ‘state’.  
> 
> > STATIC int
> > xfs_extent_free_finish_item(
> > struct xfs_trans *tp,
> > struct xfs_log_item *done,
> > struct list_head *item,
> > struct xfs_btree_cur **state)
> > 
> > What if, after each xfs_trans_free_extent call, we stuff *state with
> > xefi_startblock's AG?
> > 
> 
> see Above.
> 
> > Then, upon entry to xfs_extent_free_finish_item, we compare *state with
> > the xefi item and return EAGAIN if we're processing an EFI from the same
> > AG as the previous call to xfs_extent_free_finish_item?  Something like
> > this (totally untested):
> > 
> > STATIC int
> > xfs_extent_free_finish_item(
> > struct xfs_trans *tp,
> > struct xfs_log_item *done,
> > struct list_head *item,
> > struct xfs_btree_cur **state)
> > {
> > struct xfs_extent_free_item *xefi;
> > int error;
> > 
> > xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
> > 
> > if (*state) {
> > struct xfs_perag *oldpag = *(struct xfs_perag **)state;
> > 
> > /*
> >  * If we're freeing two extents from the same AG, we
> >  * must roll the transaction between the two extents to
> >  * avoid a livelock resulting from AGFL fixing waiting
> >  * on the extent that we just freed to become unbusy.
> >  */
> > if (oldpag == xefi->xefi_pag) {
> > xfs_fill_efd_with_efi(EFD_ITEM(done));
> > xfs_perag_put(oldpag);
> > return -EAGAIN;
> > }
> 
> As I said the first record would let go, we don’t care previous AG.
> 
> > xfs_perag_put(oldpag);
> > }
> > 
> > error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> > if (!error)
> > *state = (void *)xfs_perag_hold(xefi->xefi_pag);
> > 
> > xfs_extent_free_put_group(xefi);
> > kmem_cache_free(xfs_extfree_item_cache, xefi);
> > return error;
> > }
> > 
> > Obviously, you now need a ->finish_cleanup function to drop the perag
> > reference that may be stashed in @state.  This (I think) avoids the
> > livelock by ensuring that xfs_trans_free_extent always starts with a
> > transaction that does not contain any busy extents from the same AG that
> > it's acting on, right?
> 
> Nope, we don’t need to pass perag in state. Still the rule:
> 
> The first record in an EFI (no matter it’s the original EFI or new EFIs) just
> goes (because it won’t hit the deadlock).
> 
> For the 2nd or subsequent records, we need to see if there are any pending
> busy_extent to the same AG as the record belongs to. “pending” means
> those are still in xfs_trans->t_busy (not committed to xlog_cil_pcp->busy_extents
> yet). We know that by checking the newly introduced per AG counter
> pag_nr_pending_extents. If there are, we retry the record in a new EFI
> in a new transaction after committing the original transaction. With in the new
> EFI, the 2nd record in the original EFI become the 1st record, and it just can
> go (because it won’t hit the deadlock).

<nod> So in other words, the pag_nr_pending_extents counter buys us the
performance advantage that we don't need to add the extra roll between
(3/X) and (3/Y) if the log is running fast enough to clear the busy
extent before we even start evaluating the relogged EFI (with 3/Y in
it).

How often does it happen that we have to roll the transaction and relog
the EFI?

> We are using per AG counter here to know if there are pending busy extents
> rather than comparing with the involved previous AGs in the same EFI to avoid
> also the potential ABBA dead lock:
> thread 1:  {3/X, 4/Y}
> thread 2:  {4/Z, 3/T}
>  
> > 
> > A simpler version of the above would make it so that we always roll
> > between EFI records and don't have to manage perag references:
> > 
> > #define XFS_EFI_STATE_ROLL ((struct xfs_btree_cur *)1)
> > STATIC int
> > xfs_extent_free_finish_item(
> > struct xfs_trans *tp,
> > struct xfs_log_item *done,
> > struct list_head *item,
> > struct xfs_btree_cur **state)
> > {
> > struct xfs_extent_free_item *xefi;
> > int error;
> > 
> > /*
> >  * If we're freeing two extents, roll the transaction between
> >  * the two extents to avoid a livelock resulting from AGFL
> >  * fixing waiting on the extent that we just freed to become
> >  * unbusy.  It's not necessary to roll if the previous and
> >  * current EFI record point to different AGs, but this
> >  * simplifies the state tracking.
> >  */
> > if (*state == XFS_EFI_STATE_ROLL) {
> > xfs_fill_efd_with_efi(EFD_ITEM(done));
> > *state = NULL;
> > return -EAGAIN;
> > }
> > 
> > xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
> > 
> > error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> > if (!error)
> > *state = XFS_EFI_STATE_ROLL;
> > 
> > xfs_extent_free_put_group(xefi);
> > kmem_cache_free(xfs_extfree_item_cache, xefi);
> > return error;
> > }
> > 
> 
> By this way, it looks to me that the original EFI is always splitted into several ones
> each including only records.  It’s the same result as we change 
> XFS_EFI_MAX_FAST_EXTENTS to 1.  Dave doesn’t like that.

Yes.

> My implantation split them only when necessary as Dave suggested.
> 
> > (Did you and Dave already talk about this?)
> 
> There many discussion in this two threads:
> [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents
> [PATCH 1/2] xfs: IO time one extent per EFI

Ok, I'll go reread all that.

--D

> thanks,
> wengang
> 
>