Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Wed, 24 May 2023 16:27:19 +0000

> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> 
> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>> 
>> 
>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>> index 011b50469301..3c5a9e9952ec 100644
>>>>>> --- a/fs/xfs/xfs_extfree_item.c
>>>>>> +++ b/fs/xfs/xfs_extfree_item.c
>>>>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>>>>>> return efdp;
>>>>>> }
>>>>>> 
>>>>>> +/*
>>>>>> + * Fill the EFD with all extents from the EFI and set the counter.
>>>>>> + * Note: the EFD should comtain at least one extents already.
>>>>>> + */
>>>>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
>>>>>> +{
>>>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
>>>>>> + uint                    i;
>>>>>> +
>>>>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
>>>>>> + return;
>>>>>> +
>>>>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>>>>>> +        efdp->efd_format.efd_extents[i] =
>>>>>> +        efip->efi_format.efi_extents[i];
>>>>>> + }
>>>>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
>>>>>> +}
>>>>>> +
>>>>> 
>>>>> Ok, but it doesn't dirty the transaction or the EFD, which means....
>>>> 
>>>> Actually EAGAIN shouldn’t happen with the first record in EFIs because
>>>> the trans->t_busy is empty in AGFL block allocation for the first record.
>>>> So the dirtying work should already done with the first one.
>>> 
>>> You're assuming that the only thing we are going to want to return
>>> -EAGAIN for freeing attamps for is busy extents. Being able to
>>> restart btree operations by "commit and retry" opens up a
>>> a whole new set of performance optimisations we can make to the
>>> btree code.
>>> 
>>> IOWs, I want this functionality to be generic in nature, not
>>> tailored specifically to one situation where an -EAGAIN needs to be
>>> returned to trigger a commit an retry.
>> 
>> Yes, I assumed that because I didn’t see relevant EAGAIN handlers
>> in existing code. It’s reasonable to make it generic for existing or planed
>> EAGAINs.
>> 
>>> 
>>>>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>>>>>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
>>>>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>>>>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>>>>>> + if (error == -EAGAIN) {
>>>>>> + xfs_fill_efd_with_efi(efdp);
>>>>>> + return error;
>>>>>> + }
>>>>> 
>>>>> .... this is incorrectly placed.
>>>>> 
>>>>> The very next lines say:
>>>>> 
>>>>>> /*
>>>>>> * Mark the transaction dirty, even on error. This ensures the
>>>>>> * transaction is aborted, which:
>>>>> 
>>>>> i.e. we have to make the transaction and EFD log item dirty even if
>>>>> we have an error. In this case, the error is not fatal, but we still
>>>>> have to ensure that we commit the EFD when we roll the transaction.
>>>>> Hence the transaction and EFD still need to be dirtied on -EAGAIN...
>>>> 
>>>> see above.
>>> 
>>> See above :)
>>> 
>>>>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>>>>>> index 322eb2ee6c55..00bfe9683fa8 100644
>>>>>> --- a/fs/xfs/xfs_log_recover.c
>>>>>> +++ b/fs/xfs/xfs_log_recover.c
>>>>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>>>>>> struct xfs_log_item *lip;
>>>>>> struct xfs_ail *ailp;
>>>>>> int error = 0;
>>>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>>>> - xfs_lsn_t last_lsn;
>>>>>> -#endif
>>>>>> + xfs_lsn_t threshold_lsn;
>>>>>> 
>>>>>> ailp = log->l_ailp;
>>>>>> + threshold_lsn = xfs_ail_max_lsn(ailp);
>>>>>> spin_lock(&ailp->ail_lock);
>>>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
>>>>> 
>>>>> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
>>>>> thing.  max_lsn points to the lsn of the last entry in the AIL (in
>>>>> memory state), whilst curr_cycle/block points to the current
>>>>> physical location of the log head in the on-disk journal.
>>>>> 
>>>> 
>>>> Yes, I intended to use the lsn of the last entry in the AIL.
>>> 
>>> Again, they are not the same thing: using the last entry in the
>>> AIL here is incorrect. We want to replay all the items in the AIL
>>> that were active in the log, not up to the last item in the AIL. The
>>> actively recovered log region ends at last_lsn as per above, whilst
>>> xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
>>> we start walking it.
>> 
>> OK, got it.
>> 
>>> 
>>>> For the problem with xlog_recover_process_intents(), please see my reply to
>>>> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
>>>> the iteration but that didn’t help.  last_lsn was found quite bigger than even
>>>> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.
>>> 
>>> In what case are we queuing a *new* intent into the AIL that has a
>>> LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
>>> If we are doing that *anywhere*, then we have a likely journal
>>> corruption bug in the code because it indicates we committed that
>>> item to the journal over something in the log we are currently
>>> replaying.
>> 
>> I made a mistake to say last_lsn is quite bigger than the new EFI lsn,
>> it’s actually much bigger than the max_lsn (because cycle increased).
>> The lsn of the new EFI is exactly same as last_lsn.
>> 
>>> 
>>>>> In this case, we can't use in-memory state to determine where to
>>>>> stop the initial intent replay - recovery of other items may have
>>>>> inserted new intents beyond the end of the physical region being
>>>>> recovered, in which case using xfs_ail_max_lsn() will result in
>>>>> incorrect behaviour here.
>>>> 
>>>> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
>>>> We add the new intents to the transaction first (xfs_defer_create_intent()), add
>>>> the deferred operations to ‘capture_list’. And finally the deferred options in
>>>> ‘capture_list’ is processed after the intent-iteration on the AIL.
>>> 
>>> The changes made in that transaction, including the newly logged
>>> EFI, get committed before the rest of the work gets deferred via
>>> xfs_defer_ops_capture_and_commit(). That commits the new efi (along
>>> with all the changes that have already been made in the transaction)
>>> to the CIL, and eventually the journal checkpoints and the new EFI
>>> gets inserted into the AIL at the LSN of the checkpoint.
>>> 
>>> The LSN of the checkpoint is curr_cycle/block - the log head -
>>> because that's where the start record of the checkpoint is
>>> physically written.  As each iclog is filled, the log head moves
>>> forward - it always points at the location that the next journal
>>> write will be written to. At the end of a checkpoint, the LSN of the
>>> start record is used for AIL insertion.
>>> 
>> 
>> Thanks for explanation!
>> 
>>> Hence if a new log item created by recovery has a LSN less than
>>> last_lsn, then we have a serious bug somewhere that needs to be
>>> found and fixed. The use of last_lsn tells us something has gone
>>> badly wrong during recovery, the use of xfs_ail_max_lsn() removes
>>> the detection of the issue and now we don't know that something has
>>> gone badly wrong...
>>> 
>> 
>> I made a mistake.. the (first) new EFI lsn is the same as last_lsn, sorry
>> for confusing.
>> 
>>>> For existing other cases (if there are) where new intents are added,
>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>> if so I think we should follow their way instead of adding the defer operations
>>>> (but reply on the intents on AIL).
>>> 
>>> All of the intent recovery stuff uses
>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>> replayed and cause all further new intent processing in that chain
>>> to be defered until after all the intents recovered from the journal
>>> have been iterated. All those new intents end up in the AIL at a LSN
>>> index >= last_lsn.
>> 
>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>> or bigger than last_lsn and skip the iop_recover() for that item?
>> and shall we put this change to another separated patch as it is to fix
>> an existing problem (not introduced by my patch)?
> 
> Intent replay creates non-intent log items (like buffers or inodes) that
> are added to the AIL with an LSN higher than last_lsn.  I suppose it
> would be possible to break log recovery if an intent's iop_recover
> method immediately logged a new intent and returned EAGAIN to roll the
> transaction, but none of them do that;

I am not quite sure for above. There are cases that new intents are added
in iop_recover(), for example xfs_attri_item_recover():

632         error = xfs_xattri_finish_update(attr, done_item);
633         if (error == -EAGAIN) {
634                 /*
635                  * There's more work to do, so add the intent item to this
636                  * transaction so that we can continue it later.
637                  */
638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
640                 if (error)
641                         goto out_unlock;
642
643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
644                 xfs_irele(ip);
645                 return 0;
646         }

I am thinking line 638 and 639 are doing so.

> and I think the ASSERT you moved
> would detect such a thing.

ASSERT is nothing in production kernel, so it has less chance to detect things.
And because this is log recovery, the iop_recover() quite depends on the active log
and FS usage to log new intents. I am suspecting that’s why the problem
in xlog_recover_process_intents() wasn’t detected previously.
But looking at the existing source code, I believe that
xlog_recover_process_intents() would do double-process with new intent(s) which
are added by xfs_attri_item_recover() when it ran into the -EAGAIN case with the
original active log ATTRI.  The first process is within the AIL intents iteration, the other
is with the capture_list.

thanks,
wengang