Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 23 May 2023 17:27:43 -0700

On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
> 
> 
> > On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > 
> > On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
> >>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >>>> index 011b50469301..3c5a9e9952ec 100644
> >>>> --- a/fs/xfs/xfs_extfree_item.c
> >>>> +++ b/fs/xfs/xfs_extfree_item.c
> >>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
> >>>> return efdp;
> >>>> }
> >>>> 
> >>>> +/*
> >>>> + * Fill the EFD with all extents from the EFI and set the counter.
> >>>> + * Note: the EFD should comtain at least one extents already.
> >>>> + */
> >>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> >>>> +{
> >>>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >>>> + uint                    i;
> >>>> +
> >>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> >>>> + return;
> >>>> +
> >>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >>>> +        efdp->efd_format.efd_extents[i] =
> >>>> +        efip->efi_format.efi_extents[i];
> >>>> + }
> >>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
> >>>> +}
> >>>> +
> >>> 
> >>> Ok, but it doesn't dirty the transaction or the EFD, which means....
> >> 
> >> Actually EAGAIN shouldn’t happen with the first record in EFIs because
> >> the trans->t_busy is empty in AGFL block allocation for the first record.
> >> So the dirtying work should already done with the first one.
> > 
> > You're assuming that the only thing we are going to want to return
> > -EAGAIN for freeing attamps for is busy extents. Being able to
> > restart btree operations by "commit and retry" opens up a
> > a whole new set of performance optimisations we can make to the
> > btree code.
> > 
> > IOWs, I want this functionality to be generic in nature, not
> > tailored specifically to one situation where an -EAGAIN needs to be
> > returned to trigger a commit an retry.
> 
> Yes, I assumed that because I didn’t see relevant EAGAIN handlers
> in existing code. It’s reasonable to make it generic for existing or planed
> EAGAINs.
> 
> > 
> >>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
> >>>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
> >>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> >>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> >>>> + if (error == -EAGAIN) {
> >>>> + xfs_fill_efd_with_efi(efdp);
> >>>> + return error;
> >>>> + }
> >>> 
> >>> .... this is incorrectly placed.
> >>> 
> >>> The very next lines say:
> >>> 
> >>>> /*
> >>>> * Mark the transaction dirty, even on error. This ensures the
> >>>> * transaction is aborted, which:
> >>> 
> >>> i.e. we have to make the transaction and EFD log item dirty even if
> >>> we have an error. In this case, the error is not fatal, but we still
> >>> have to ensure that we commit the EFD when we roll the transaction.
> >>> Hence the transaction and EFD still need to be dirtied on -EAGAIN...
> >> 
> >> see above.
> > 
> > See above :)
> > 
> >>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> >>>> index 322eb2ee6c55..00bfe9683fa8 100644
> >>>> --- a/fs/xfs/xfs_log_recover.c
> >>>> +++ b/fs/xfs/xfs_log_recover.c
> >>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
> >>>> struct xfs_log_item *lip;
> >>>> struct xfs_ail *ailp;
> >>>> int error = 0;
> >>>> -#if defined(DEBUG) || defined(XFS_WARN)
> >>>> - xfs_lsn_t last_lsn;
> >>>> -#endif
> >>>> + xfs_lsn_t threshold_lsn;
> >>>> 
> >>>> ailp = log->l_ailp;
> >>>> + threshold_lsn = xfs_ail_max_lsn(ailp);
> >>>> spin_lock(&ailp->ail_lock);
> >>>> -#if defined(DEBUG) || defined(XFS_WARN)
> >>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> >>> 
> >>> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
> >>> thing.  max_lsn points to the lsn of the last entry in the AIL (in
> >>> memory state), whilst curr_cycle/block points to the current
> >>> physical location of the log head in the on-disk journal.
> >>> 
> >> 
> >> Yes, I intended to use the lsn of the last entry in the AIL.
> > 
> > Again, they are not the same thing: using the last entry in the
> > AIL here is incorrect. We want to replay all the items in the AIL
> > that were active in the log, not up to the last item in the AIL. The
> > actively recovered log region ends at last_lsn as per above, whilst
> > xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
> > we start walking it.
> 
> OK, got it.
> 
> > 
> >> For the problem with xlog_recover_process_intents(), please see my reply to
> >> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
> >> the iteration but that didn’t help.  last_lsn was found quite bigger than even
> >> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.
> > 
> > In what case are we queuing a *new* intent into the AIL that has a
> > LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
> > If we are doing that *anywhere*, then we have a likely journal
> > corruption bug in the code because it indicates we committed that
> > item to the journal over something in the log we are currently
> > replaying.
> 
> I made a mistake to say last_lsn is quite bigger than the new EFI lsn,
> it’s actually much bigger than the max_lsn (because cycle increased).
> The lsn of the new EFI is exactly same as last_lsn.
> 
> > 
> >>> In this case, we can't use in-memory state to determine where to
> >>> stop the initial intent replay - recovery of other items may have
> >>> inserted new intents beyond the end of the physical region being
> >>> recovered, in which case using xfs_ail_max_lsn() will result in
> >>> incorrect behaviour here.
> >> 
> >> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
> >> We add the new intents to the transaction first (xfs_defer_create_intent()), add
> >> the deferred operations to ‘capture_list’. And finally the deferred options in
> >> ‘capture_list’ is processed after the intent-iteration on the AIL.
> > 
> > The changes made in that transaction, including the newly logged
> > EFI, get committed before the rest of the work gets deferred via
> > xfs_defer_ops_capture_and_commit(). That commits the new efi (along
> > with all the changes that have already been made in the transaction)
> > to the CIL, and eventually the journal checkpoints and the new EFI
> > gets inserted into the AIL at the LSN of the checkpoint.
> > 
> > The LSN of the checkpoint is curr_cycle/block - the log head -
> > because that's where the start record of the checkpoint is
> > physically written.  As each iclog is filled, the log head moves
> > forward - it always points at the location that the next journal
> > write will be written to. At the end of a checkpoint, the LSN of the
> > start record is used for AIL insertion.
> > 
> 
> Thanks for explanation!
> 
> > Hence if a new log item created by recovery has a LSN less than
> > last_lsn, then we have a serious bug somewhere that needs to be
> > found and fixed. The use of last_lsn tells us something has gone
> > badly wrong during recovery, the use of xfs_ail_max_lsn() removes
> > the detection of the issue and now we don't know that something has
> > gone badly wrong...
> > 
> 
> I made a mistake.. the (first) new EFI lsn is the same as last_lsn, sorry
> for confusing.
> 
> >> For existing other cases (if there are) where new intents are added,
> >> they don’t use the capture_list for delayed operations? Do you have example then? 
> >> if so I think we should follow their way instead of adding the defer operations
> >> (but reply on the intents on AIL).
> > 
> > All of the intent recovery stuff uses
> > xfs_defer_ops_capture_and_commit() to commit the intent being
> > replayed and cause all further new intent processing in that chain
> > to be defered until after all the intents recovered from the journal
> > have been iterated. All those new intents end up in the AIL at a LSN
> > index >= last_lsn.
> 
> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
> or bigger than last_lsn and skip the iop_recover() for that item?
> and shall we put this change to another separated patch as it is to fix
> an existing problem (not introduced by my patch)?

Intent replay creates non-intent log items (like buffers or inodes) that
are added to the AIL with an LSN higher than last_lsn.  I suppose it
would be possible to break log recovery if an intent's iop_recover
method immediately logged a new intent and returned EAGAIN to roll the
transaction, but none of them do that; and I think the ASSERT you moved
would detect such a thing.

--D

> thanks,
> wengang