Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 23 May 2023 08:40:37 +1000

On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
> > On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >> index 011b50469301..3c5a9e9952ec 100644
> >> --- a/fs/xfs/xfs_extfree_item.c
> >> +++ b/fs/xfs/xfs_extfree_item.c
> >> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
> >> return efdp;
> >> }
> >> 
> >> +/*
> >> + * Fill the EFD with all extents from the EFI and set the counter.
> >> + * Note: the EFD should comtain at least one extents already.
> >> + */
> >> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> >> +{
> >> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >> + uint                    i;
> >> +
> >> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> >> + return;
> >> +
> >> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >> +        efdp->efd_format.efd_extents[i] =
> >> +        efip->efi_format.efi_extents[i];
> >> + }
> >> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
> >> +}
> >> +
> > 
> > Ok, but it doesn't dirty the transaction or the EFD, which means....
> 
> Actually EAGAIN shouldn’t happen with the first record in EFIs because
> the trans->t_busy is empty in AGFL block allocation for the first record.
> So the dirtying work should already done with the first one.

You're assuming that the only thing we are going to want to return
-EAGAIN for freeing attamps for is busy extents. Being able to
restart btree operations by "commit and retry" opens up a
a whole new set of performance optimisations we can make to the
btree code.

IOWs, I want this functionality to be generic in nature, not
tailored specifically to one situation where an -EAGAIN needs to be
returned to trigger a commit an retry.

> >> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
> >> error = __xfs_free_extent(tp, xefi->xefi_startblock,
> >> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> >> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> >> + if (error == -EAGAIN) {
> >> + xfs_fill_efd_with_efi(efdp);
> >> + return error;
> >> + }
> > 
> > .... this is incorrectly placed.
> > 
> > The very next lines say:
> > 
> >> /*
> >>  * Mark the transaction dirty, even on error. This ensures the
> >>  * transaction is aborted, which:
> > 
> > i.e. we have to make the transaction and EFD log item dirty even if
> > we have an error. In this case, the error is not fatal, but we still
> > have to ensure that we commit the EFD when we roll the transaction.
> > Hence the transaction and EFD still need to be dirtied on -EAGAIN...
> 
> see above.

See above :)

> >> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> >> index 322eb2ee6c55..00bfe9683fa8 100644
> >> --- a/fs/xfs/xfs_log_recover.c
> >> +++ b/fs/xfs/xfs_log_recover.c
> >> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
> >> struct xfs_log_item *lip;
> >> struct xfs_ail *ailp;
> >> int error = 0;
> >> -#if defined(DEBUG) || defined(XFS_WARN)
> >> - xfs_lsn_t last_lsn;
> >> -#endif
> >> + xfs_lsn_t threshold_lsn;
> >> 
> >> ailp = log->l_ailp;
> >> + threshold_lsn = xfs_ail_max_lsn(ailp);
> >> spin_lock(&ailp->ail_lock);
> >> -#if defined(DEBUG) || defined(XFS_WARN)
> >> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> > 
> > xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
> > thing.  max_lsn points to the lsn of the last entry in the AIL (in
> > memory state), whilst curr_cycle/block points to the current
> > physical location of the log head in the on-disk journal.
> > 
> 
> Yes, I intended to use the lsn of the last entry in the AIL.

Again, they are not the same thing: using the last entry in the
AIL here is incorrect. We want to replay all the items in the AIL
that were active in the log, not up to the last item in the AIL. The
actively recovered log region ends at last_lsn as per above, whilst
xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
we start walking it.

> For the problem with xlog_recover_process_intents(), please see my reply to
> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
> the iteration but that didn’t help.  last_lsn was found quite bigger than even
> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.

In what case are we queuing a *new* intent into the AIL that has a
LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
If we are doing that *anywhere*, then we have a likely journal
corruption bug in the code because it indicates we committed that
item to the journal over something in the log we are currently
replaying.

> > In this case, we can't use in-memory state to determine where to
> > stop the initial intent replay - recovery of other items may have
> > inserted new intents beyond the end of the physical region being
> > recovered, in which case using xfs_ail_max_lsn() will result in
> > incorrect behaviour here.
> 
> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
> We add the new intents to the transaction first (xfs_defer_create_intent()), add
> the deferred operations to ‘capture_list’. And finally the deferred options in
> ‘capture_list’ is processed after the intent-iteration on the AIL.

The changes made in that transaction, including the newly logged
EFI, get committed before the rest of the work gets deferred via
xfs_defer_ops_capture_and_commit(). That commits the new efi (along
with all the changes that have already been made in the transaction)
to the CIL, and eventually the journal checkpoints and the new EFI
gets inserted into the AIL at the LSN of the checkpoint.

The LSN of the checkpoint is curr_cycle/block - the log head -
because that's where the start record of the checkpoint is
physically written.  As each iclog is filled, the log head moves
forward - it always points at the location that the next journal
write will be written to. At the end of a checkpoint, the LSN of the
start record is used for AIL insertion.

Hence if a new log item created by recovery has a LSN less than
last_lsn, then we have a serious bug somewhere that needs to be
found and fixed. The use of last_lsn tells us something has gone
badly wrong during recovery, the use of xfs_ail_max_lsn() removes
the detection of the issue and now we don't know that something has
gone badly wrong...

> For existing other cases (if there are) where new intents are added,
> they don’t use the capture_list for delayed operations? Do you have example then? 
> if so I think we should follow their way instead of adding the defer operations
> (but reply on the intents on AIL).

All of the intent recovery stuff uses
xfs_defer_ops_capture_and_commit() to commit the intent being
replayed and cause all further new intent processing in that chain
to be defered until after all the intents recovered from the journal
have been iterated. All those new intents end up in the AIL at a LSN
index >= last_lsn.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx