Hi Dave and Darrick, What do you think? thanks, wengang > On May 26, 2023, at 2:31 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: > > > >> On May 26, 2023, at 11:59 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >> >> >> >>> On May 25, 2023, at 5:26 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> >>> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote: >>>> >>>> >>>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> >>>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote: >>>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >>>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote: >>>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote: >>>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote: >>>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >>>>>>>>>> For existing other cases (if there are) where new intents are added, >>>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? >>>>>>>>>> if so I think we should follow their way instead of adding the defer operations >>>>>>>>>> (but reply on the intents on AIL). >>>>>>>>> >>>>>>>>> All of the intent recovery stuff uses >>>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being >>>>>>>>> replayed and cause all further new intent processing in that chain >>>>>>>>> to be defered until after all the intents recovered from the journal >>>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN >>>>>>>>> index >= last_lsn. >>>>>>>> >>>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to >>>>>>>> or bigger than last_lsn and skip the iop_recover() for that item? >>>>>>>> and shall we put this change to another separated patch as it is to fix >>>>>>>> an existing problem (not introduced by my patch)? >>>>>>> >>>>>>> Intent replay creates non-intent log items (like buffers or inodes) that >>>>>>> are added to the AIL with an LSN higher than last_lsn. I suppose it >>>>>>> would be possible to break log recovery if an intent's iop_recover >>>>>>> method immediately logged a new intent and returned EAGAIN to roll the >>>>>>> transaction, but none of them do that; >>>>>> >>>>>> I am not quite sure for above. There are cases that new intents are added >>>>>> in iop_recover(), for example xfs_attri_item_recover(): >>>>>> >>>>>> 632 error = xfs_xattri_finish_update(attr, done_item); >>>>>> 633 if (error == -EAGAIN) { >>>>>> 634 /* >>>>>> 635 * There's more work to do, so add the intent item to this >>>>>> 636 * transaction so that we can continue it later. >>>>>> 637 */ >>>>>> 638 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); >>>>>> 639 error = xfs_defer_ops_capture_and_commit(tp, capture_list); >>>>>> 640 if (error) >>>>>> 641 goto out_unlock; >>>>>> 642 >>>>>> 643 xfs_iunlock(ip, XFS_ILOCK_EXCL); >>>>>> 644 xfs_irele(ip); >>>>>> 645 return 0; >>>>>> 646 } >>>>>> >>>>>> I am thinking line 638 and 639 are doing so. >>>>> >>>>> I don't think so. @attrip is the attribute information recovered >>>>> from the original intent - we allocated @attr in >>>>> xfs_attri_item_recover() to enable the operation indicated by >>>>> @attrip to be executed. We copy stuff from @attrip (the item we are >>>>> recovering) to the new @attr structure (the work we need to do) and >>>>> run it through the attr modification state machine. If there's more >>>>> work to be done, we then add @attr to the defer list. >>>> >>>> Assuming “if there’s more work to be done” is the -EAGAIN case... >>>> >>>>> >>>>> If there is deferred work to be continued, we still need a new >>>>> intent to indicate what attr operation needs to be performed next in >>>>> the chain. When we commit the transaction (639), it will create a >>>>> new intent log item for the deferred attribute operation in the >>>>> ->create_intent callback before the transaction is committed. >>>> >>>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted >>>> to say is that: >>>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on >>>> AIL before the AIL iteration ends in xlog_recover_process_intents(). >>>> Do you agree with above? >>> >>> Recovery of intents has *always* been able to do this. What I don't >>> understand is why you think this is a problem. >> >> Adding new intent is not a problem. >> The problem I am thinking is that for the NEW intents, >> we should’d run both: > I meant “shouldn’t” > > thanks, > wengang > >> 1. iop_recover() during the AIL intents iteration and >> 2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls() >> >> I am thinking the iop_recover() and the finish_item() are doing same thing. >> I am not quite sure about ATTRI but I am sure about EFI. For EFI, >> both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item() >> (as finish_item()) calls xfs_trans_free_extent() to free the same extent(s). >> >> I have a metadump with which I can reproduce the original AGFL deadlock problem >> during log recovery. When I tested my patch (without the changes in >> xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice, >> one is during the AIL intents iteration with iop_recover(), the other is inside >> xlog_finish_defer_ops() with finish_item(). >> >> That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents(). >> As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents >> only. Those original redo intents has no deferred option attached. And for the NEW intents >> introduced during the processing of the original intents, I don’t think iop_recover() should >> run for them becuase >> 1. they are NEW ones, not the original on disk redo intents and >> 2. they have deferred operation attached which process the new intents later. >> >> And I am also thinking we also depend xlog_recover_process_intents() to run iop_recover() >> on the new intents. That’s because the new intents could be already inserted to AIL >> before the AIL intents iteration ends, in that case have the change to run iop_recover() on the >> new ones. but the new intents also could be inserted to AIL after the AIL intents iteration ends, >> in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing. >> >> >>> >>>> In the following I will talk about the new intent log item. >>>> >>>> The iop_recover() is then executed on the new intent during the AIL intents iteration. >>>> I meant this loop: >>>> >>>> 2548 for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0); >>>> 2549 lip != NULL; >>>> 2550 lip = xfs_trans_ail_cursor_next(ailp, &cur)) { >>>> …. >>>> 2584 } >>>> Do you agree with above? >>> >>> Repeatedly asserting that I must agree that your logic is correct >>> comes across as unecessarily aggressive and combative and is not >>> acceptible behaviour. >>> >>> Aggressively asserting that you are right doesn't make your logic >>> any more compelling. It just makes you look really bad when, >>> inevitably, you get politely told that you've made yet another >>> mistake. >> >> Nope, I didn’t want to be aggressive. Since this is already discussed a lot >> but we still don’t have an agreement. So I wanted to know where is the disagreement, >> little by little. I thought it would be a good way to make things clear, >> if it made you unhappy, sorry, and I didn’t mean that. >> >>> >>>> And the lsn for the new intent is equal to or bigger than last_lsn. >>>> Do you agree with above? >>>> >>>> In above case the iop_recover() is xfs_attri_item_recover(). The later >>>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1 >>>> and process attr1. >>>> Do you agree with above? >>>> >>>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends. >>> >>> We ran xfs_xattri_finish_update(attr, done_item) which returned >>> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified >>> -something- before we logged the new intent. That, however, doesn't >>> dictate what we need to log in the new intent - that is determined >>> by the overall intent chain and recovery architecture the subsystem >>> uses. And the attribute modification architecture is extremely >>> subtle, complex and full of unexpected loops. >> >> I am a bit confused here. >> I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing >> (free the extent specified in the EFI record(s)). So whatever the new would be logged int >> the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover() >> and xfs_attr_finish_item() should do the same thing. If they don’t not, ATTIR has inconsistent >> behavior with EFI. >> >>> >>>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The >>>> capture_list contains the deferred operation which was added at line 639 with type >>>> XFS_DEFER_OPS_TYPE_ATTR. >>>> Do you agree with above? >>>> >>>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc(). >>>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new >>>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by >>>> xfs_trans_commit(). >>>> Do you agree with above? >>> >>> So at this point it we are logging the same intent *information* >>> again. But it's not the same intent - we've retired the original >>> intent and we have a new intent chain for the operations we are >>> going to perform. But why are we logging the same information in a >>> new intent? >>> >>> Well, the attr modification state machine was -carefully- designed >>> to work this way: if we crash part way through an intent chain, we >>> always need to restart it in recovery with a "garbage collection" >>> step that undoes the partially complete changes that had been made >>> before we crashed. Only once we've done that garbage collection step >>> can we restart the original operation the intent described. >>> >>> Indeed, the operations we perform recovering an attr intent are >>> actually different to the operations we perform with those intents >>> at runtime. Recovery has to perform garbage collection as a first >>> step, yet runtime does not. >>> >>> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is >>> typically performing garbage collection in the first recovery >>> transaction. If it does perform a gc operation, then it is >>> guaranteed to return -EAGAIN so that the actual operation recovery >>> needs to perform can have new intents logged, be deferred and then >>> eventually replayed. >>> >>> If we don't need to perform gc, then the operation we perform may >>> still require multiple transactions to complete and we get -EAGAIN >>> in that case, too. When that happens, we need to ensure if we crash >>> before recovery completes that the next mount will perform the >>> correct GC steps before it replays the intent from scratch. So we >>> always need to log an intent for the xattr operation that will >>> trigger the GC step on recovery. >>> >>> Either way, we need to log new intents that are identical to the >>> original one at each step of the process so that log recovery will >>> always start the intent recovery process with the correct garbage >>> collection operation.... >>> >>>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and >>>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the >>>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item() >>>> is called. >>>> Do you agree with above? >>>> >>>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI >>>> and is processed. attr2 and attr1 contain exactly same thing because they are both >>>> from the new ATTRI. So the processing on attr2 is pure a duplication of the processing >>>> on attr1. So actually the new ATTRI are double-processed. >>>> Do you agree with above? >>> >>> No, that's wrong. There is now "attr1 and attr2" structures - they >>> are one and the same. i.e. the xfs_attr_intent used in >>> xfs_attr_finish_item() is the same one created in >>> xfs_attri_item_recover() that we called >>> xfs_defer_add(&attr->xattri_list) with way back at line 638. >>> >>> xfs_attr_finish_item( >>> struct xfs_trans *tp, >>> struct xfs_log_item *done, >>> struct list_head *item, >>> struct xfs_btree_cur **state) >>> { >>> struct xfs_attr_intent *attr; >>> struct xfs_attrd_log_item *done_item = NULL; >>> int error; >>> >>>>>>>> attr = container_of(item, struct xfs_attr_intent, xattri_list); >>> if (done) >>> done_item = ATTRD_ITEM(done); >>> >>> xfs_attr_finish_item() extracts the @attr structure from the >>> listhead that is linking it into the deferred op chain, and we >>> run xfs_attr_set_iter() again to run the next step(s) in the xattr >>> modification state machine. If this returns -EAGAIN (and it can) >>> we go around the "defer(attr->attri_list), create new intent, >>> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again. >>> >>> From this, I am guessing that you are confusing the per-defer-cycle >>> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We >>> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every >>> time we create new intents in the deferred operation chain. Each >>> xfs_attri_log_item is created from the state that is carried >>> across entire the attr recovery chain in the xfs_attr_intent >>> structure.... >> >> OK, ATTRI really looks complex. And it looks like the >> xfs_attri_item_recover() and xfs_attr_finish_item() can do different things >> according the current state. Thus running both xfs_attri_item_recover() >> and xfs_attr_finish_item() has no problem. I didn’t take a good example :(. >> >> But, the original >> ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0); >> in xlog_recover_process_intents() is incorrect, right? For example if we >> introduced two different new ATTRIs and the 2nd should have li_lsn bigger >> than last_lsn? >> >> Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine. >> xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing — >> free the extent specified by the EFI record(s). For the new EFI, the >> xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it >> want to free the extent which is already freed by the xfs_efi_item_recover(). >> We pass some information (indicating it already done) from xfs_efi_item_recover() to >> xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do >> the 2nd free as a workaround? What’s your idea? >> >> thanks, >> wengang