> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote: >> >> >>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> >>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote: >>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote: >>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >>>>>> index 011b50469301..3c5a9e9952ec 100644 >>>>>> --- a/fs/xfs/xfs_extfree_item.c >>>>>> +++ b/fs/xfs/xfs_extfree_item.c >>>>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd( >>>>>> return efdp; >>>>>> } >>>>>> >>>>>> +/* >>>>>> + * Fill the EFD with all extents from the EFI and set the counter. >>>>>> + * Note: the EFD should comtain at least one extents already. >>>>>> + */ >>>>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp) >>>>>> +{ >>>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip; >>>>>> + uint i; >>>>>> + >>>>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents) >>>>>> + return; >>>>>> + >>>>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) { >>>>>> + efdp->efd_format.efd_extents[i] = >>>>>> + efip->efi_format.efi_extents[i]; >>>>>> + } >>>>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents; >>>>>> +} >>>>>> + >>>>> >>>>> Ok, but it doesn't dirty the transaction or the EFD, which means.... >>>> >>>> Actually EAGAIN shouldn’t happen with the first record in EFIs because >>>> the trans->t_busy is empty in AGFL block allocation for the first record. >>>> So the dirtying work should already done with the first one. >>> >>> You're assuming that the only thing we are going to want to return >>> -EAGAIN for freeing attamps for is busy extents. Being able to >>> restart btree operations by "commit and retry" opens up a >>> a whole new set of performance optimisations we can make to the >>> btree code. >>> >>> IOWs, I want this functionality to be generic in nature, not >>> tailored specifically to one situation where an -EAGAIN needs to be >>> returned to trigger a commit an retry. >> >> Yes, I assumed that because I didn’t see relevant EAGAIN handlers >> in existing code. It’s reasonable to make it generic for existing or planed >> EAGAINs. >> >>> >>>>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent( >>>>>> error = __xfs_free_extent(tp, xefi->xefi_startblock, >>>>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE, >>>>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); >>>>>> + if (error == -EAGAIN) { >>>>>> + xfs_fill_efd_with_efi(efdp); >>>>>> + return error; >>>>>> + } >>>>> >>>>> .... this is incorrectly placed. >>>>> >>>>> The very next lines say: >>>>> >>>>>> /* >>>>>> * Mark the transaction dirty, even on error. This ensures the >>>>>> * transaction is aborted, which: >>>>> >>>>> i.e. we have to make the transaction and EFD log item dirty even if >>>>> we have an error. In this case, the error is not fatal, but we still >>>>> have to ensure that we commit the EFD when we roll the transaction. >>>>> Hence the transaction and EFD still need to be dirtied on -EAGAIN... >>>> >>>> see above. >>> >>> See above :) >>> >>>>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c >>>>>> index 322eb2ee6c55..00bfe9683fa8 100644 >>>>>> --- a/fs/xfs/xfs_log_recover.c >>>>>> +++ b/fs/xfs/xfs_log_recover.c >>>>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents( >>>>>> struct xfs_log_item *lip; >>>>>> struct xfs_ail *ailp; >>>>>> int error = 0; >>>>>> -#if defined(DEBUG) || defined(XFS_WARN) >>>>>> - xfs_lsn_t last_lsn; >>>>>> -#endif >>>>>> + xfs_lsn_t threshold_lsn; >>>>>> >>>>>> ailp = log->l_ailp; >>>>>> + threshold_lsn = xfs_ail_max_lsn(ailp); >>>>>> spin_lock(&ailp->ail_lock); >>>>>> -#if defined(DEBUG) || defined(XFS_WARN) >>>>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block); >>>>> >>>>> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same >>>>> thing. max_lsn points to the lsn of the last entry in the AIL (in >>>>> memory state), whilst curr_cycle/block points to the current >>>>> physical location of the log head in the on-disk journal. >>>>> >>>> >>>> Yes, I intended to use the lsn of the last entry in the AIL. >>> >>> Again, they are not the same thing: using the last entry in the >>> AIL here is incorrect. We want to replay all the items in the AIL >>> that were active in the log, not up to the last item in the AIL. The >>> actively recovered log region ends at last_lsn as per above, whilst >>> xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before >>> we start walking it. >> >> OK, got it. >> >>> >>>> For the problem with xlog_recover_process_intents(), please see my reply to >>>> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop >>>> the iteration but that didn’t help. last_lsn was found quite bigger than even >>>> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem. >>> >>> In what case are we queuing a *new* intent into the AIL that has a >>> LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)? >>> If we are doing that *anywhere*, then we have a likely journal >>> corruption bug in the code because it indicates we committed that >>> item to the journal over something in the log we are currently >>> replaying. >> >> I made a mistake to say last_lsn is quite bigger than the new EFI lsn, >> it’s actually much bigger than the max_lsn (because cycle increased). >> The lsn of the new EFI is exactly same as last_lsn. >> >>> >>>>> In this case, we can't use in-memory state to determine where to >>>>> stop the initial intent replay - recovery of other items may have >>>>> inserted new intents beyond the end of the physical region being >>>>> recovered, in which case using xfs_ail_max_lsn() will result in >>>>> incorrect behaviour here. >>>> >>>> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here). >>>> We add the new intents to the transaction first (xfs_defer_create_intent()), add >>>> the deferred operations to ‘capture_list’. And finally the deferred options in >>>> ‘capture_list’ is processed after the intent-iteration on the AIL. >>> >>> The changes made in that transaction, including the newly logged >>> EFI, get committed before the rest of the work gets deferred via >>> xfs_defer_ops_capture_and_commit(). That commits the new efi (along >>> with all the changes that have already been made in the transaction) >>> to the CIL, and eventually the journal checkpoints and the new EFI >>> gets inserted into the AIL at the LSN of the checkpoint. >>> >>> The LSN of the checkpoint is curr_cycle/block - the log head - >>> because that's where the start record of the checkpoint is >>> physically written. As each iclog is filled, the log head moves >>> forward - it always points at the location that the next journal >>> write will be written to. At the end of a checkpoint, the LSN of the >>> start record is used for AIL insertion. >>> >> >> Thanks for explanation! >> >>> Hence if a new log item created by recovery has a LSN less than >>> last_lsn, then we have a serious bug somewhere that needs to be >>> found and fixed. The use of last_lsn tells us something has gone >>> badly wrong during recovery, the use of xfs_ail_max_lsn() removes >>> the detection of the issue and now we don't know that something has >>> gone badly wrong... >>> >> >> I made a mistake.. the (first) new EFI lsn is the same as last_lsn, sorry >> for confusing. >> >>>> For existing other cases (if there are) where new intents are added, >>>> they don’t use the capture_list for delayed operations? Do you have example then? >>>> if so I think we should follow their way instead of adding the defer operations >>>> (but reply on the intents on AIL). >>> >>> All of the intent recovery stuff uses >>> xfs_defer_ops_capture_and_commit() to commit the intent being >>> replayed and cause all further new intent processing in that chain >>> to be defered until after all the intents recovered from the journal >>> have been iterated. All those new intents end up in the AIL at a LSN >>> index >= last_lsn. >> >> Yes. So we break the AIL iteration on seeing an intent with lsn equal to >> or bigger than last_lsn and skip the iop_recover() for that item? >> and shall we put this change to another separated patch as it is to fix >> an existing problem (not introduced by my patch)? > > Intent replay creates non-intent log items (like buffers or inodes) that > are added to the AIL with an LSN higher than last_lsn. I suppose it > would be possible to break log recovery if an intent's iop_recover > method immediately logged a new intent and returned EAGAIN to roll the > transaction, but none of them do that; I am not quite sure for above. There are cases that new intents are added in iop_recover(), for example xfs_attri_item_recover(): 632 error = xfs_xattri_finish_update(attr, done_item); 633 if (error == -EAGAIN) { 634 /* 635 * There's more work to do, so add the intent item to this 636 * transaction so that we can continue it later. 637 */ 638 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); 639 error = xfs_defer_ops_capture_and_commit(tp, capture_list); 640 if (error) 641 goto out_unlock; 642 643 xfs_iunlock(ip, XFS_ILOCK_EXCL); 644 xfs_irele(ip); 645 return 0; 646 } I am thinking line 638 and 639 are doing so. > and I think the ASSERT you moved > would detect such a thing. ASSERT is nothing in production kernel, so it has less chance to detect things. And because this is log recovery, the iop_recover() quite depends on the active log and FS usage to log new intents. I am suspecting that’s why the problem in xlog_recover_process_intents() wasn’t detected previously. But looking at the existing source code, I believe that xlog_recover_process_intents() would do double-process with new intent(s) which are added by xfs_attri_item_recover() when it ran into the -EAGAIN case with the original active log ATTRI. The first process is within the AIL intents iteration, the other is with the capture_list. thanks, wengang