> On May 26, 2023, at 11:59 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: > > > >> On May 25, 2023, at 5:26 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> >> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote: >>> >>> >>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>> >>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote: >>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote: >>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote: >>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote: >>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >>>>>>>>> For existing other cases (if there are) where new intents are added, >>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? >>>>>>>>> if so I think we should follow their way instead of adding the defer operations >>>>>>>>> (but reply on the intents on AIL). >>>>>>>> >>>>>>>> All of the intent recovery stuff uses >>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being >>>>>>>> replayed and cause all further new intent processing in that chain >>>>>>>> to be defered until after all the intents recovered from the journal >>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN >>>>>>>> index >= last_lsn. >>>>>>> >>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to >>>>>>> or bigger than last_lsn and skip the iop_recover() for that item? >>>>>>> and shall we put this change to another separated patch as it is to fix >>>>>>> an existing problem (not introduced by my patch)? >>>>>> >>>>>> Intent replay creates non-intent log items (like buffers or inodes) that >>>>>> are added to the AIL with an LSN higher than last_lsn. I suppose it >>>>>> would be possible to break log recovery if an intent's iop_recover >>>>>> method immediately logged a new intent and returned EAGAIN to roll the >>>>>> transaction, but none of them do that; >>>>> >>>>> I am not quite sure for above. There are cases that new intents are added >>>>> in iop_recover(), for example xfs_attri_item_recover(): >>>>> >>>>> 632 error = xfs_xattri_finish_update(attr, done_item); >>>>> 633 if (error == -EAGAIN) { >>>>> 634 /* >>>>> 635 * There's more work to do, so add the intent item to this >>>>> 636 * transaction so that we can continue it later. >>>>> 637 */ >>>>> 638 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); >>>>> 639 error = xfs_defer_ops_capture_and_commit(tp, capture_list); >>>>> 640 if (error) >>>>> 641 goto out_unlock; >>>>> 642 >>>>> 643 xfs_iunlock(ip, XFS_ILOCK_EXCL); >>>>> 644 xfs_irele(ip); >>>>> 645 return 0; >>>>> 646 } >>>>> >>>>> I am thinking line 638 and 639 are doing so. >>>> >>>> I don't think so. @attrip is the attribute information recovered >>>> from the original intent - we allocated @attr in >>>> xfs_attri_item_recover() to enable the operation indicated by >>>> @attrip to be executed. We copy stuff from @attrip (the item we are >>>> recovering) to the new @attr structure (the work we need to do) and >>>> run it through the attr modification state machine. If there's more >>>> work to be done, we then add @attr to the defer list. >>> >>> Assuming “if there’s more work to be done” is the -EAGAIN case... >>> >>>> >>>> If there is deferred work to be continued, we still need a new >>>> intent to indicate what attr operation needs to be performed next in >>>> the chain. When we commit the transaction (639), it will create a >>>> new intent log item for the deferred attribute operation in the >>>> ->create_intent callback before the transaction is committed. >>> >>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted >>> to say is that: >>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on >>> AIL before the AIL iteration ends in xlog_recover_process_intents(). >>> Do you agree with above? >> >> Recovery of intents has *always* been able to do this. What I don't >> understand is why you think this is a problem. > > Adding new intent is not a problem. > The problem I am thinking is that for the NEW intents, > we should’d run both: I meant “shouldn’t” thanks, wengang > 1. iop_recover() during the AIL intents iteration and > 2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls() > > I am thinking the iop_recover() and the finish_item() are doing same thing. > I am not quite sure about ATTRI but I am sure about EFI. For EFI, > both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item() > (as finish_item()) calls xfs_trans_free_extent() to free the same extent(s). > > I have a metadump with which I can reproduce the original AGFL deadlock problem > during log recovery. When I tested my patch (without the changes in > xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice, > one is during the AIL intents iteration with iop_recover(), the other is inside > xlog_finish_defer_ops() with finish_item(). > > That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents(). > As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents > only. Those original redo intents has no deferred option attached. And for the NEW intents > introduced during the processing of the original intents, I don’t think iop_recover() should > run for them becuase > 1. they are NEW ones, not the original on disk redo intents and > 2. they have deferred operation attached which process the new intents later. > > And I am also thinking we also depend xlog_recover_process_intents() to run iop_recover() > on the new intents. That’s because the new intents could be already inserted to AIL > before the AIL intents iteration ends, in that case have the change to run iop_recover() on the > new ones. but the new intents also could be inserted to AIL after the AIL intents iteration ends, > in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing. > > >> >>> In the following I will talk about the new intent log item. >>> >>> The iop_recover() is then executed on the new intent during the AIL intents iteration. >>> I meant this loop: >>> >>> 2548 for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0); >>> 2549 lip != NULL; >>> 2550 lip = xfs_trans_ail_cursor_next(ailp, &cur)) { >>> …. >>> 2584 } >>> Do you agree with above? >> >> Repeatedly asserting that I must agree that your logic is correct >> comes across as unecessarily aggressive and combative and is not >> acceptible behaviour. >> >> Aggressively asserting that you are right doesn't make your logic >> any more compelling. It just makes you look really bad when, >> inevitably, you get politely told that you've made yet another >> mistake. > > Nope, I didn’t want to be aggressive. Since this is already discussed a lot > but we still don’t have an agreement. So I wanted to know where is the disagreement, > little by little. I thought it would be a good way to make things clear, > if it made you unhappy, sorry, and I didn’t mean that. > >> >>> And the lsn for the new intent is equal to or bigger than last_lsn. >>> Do you agree with above? >>> >>> In above case the iop_recover() is xfs_attri_item_recover(). The later >>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1 >>> and process attr1. >>> Do you agree with above? >>> >>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends. >> >> We ran xfs_xattri_finish_update(attr, done_item) which returned >> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified >> -something- before we logged the new intent. That, however, doesn't >> dictate what we need to log in the new intent - that is determined >> by the overall intent chain and recovery architecture the subsystem >> uses. And the attribute modification architecture is extremely >> subtle, complex and full of unexpected loops. > > I am a bit confused here. > I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing > (free the extent specified in the EFI record(s)). So whatever the new would be logged int > the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover() > and xfs_attr_finish_item() should do the same thing. If they don’t not, ATTIR has inconsistent > behavior with EFI. > >> >>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The >>> capture_list contains the deferred operation which was added at line 639 with type >>> XFS_DEFER_OPS_TYPE_ATTR. >>> Do you agree with above? >>> >>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc(). >>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new >>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by >>> xfs_trans_commit(). >>> Do you agree with above? >> >> So at this point it we are logging the same intent *information* >> again. But it's not the same intent - we've retired the original >> intent and we have a new intent chain for the operations we are >> going to perform. But why are we logging the same information in a >> new intent? >> >> Well, the attr modification state machine was -carefully- designed >> to work this way: if we crash part way through an intent chain, we >> always need to restart it in recovery with a "garbage collection" >> step that undoes the partially complete changes that had been made >> before we crashed. Only once we've done that garbage collection step >> can we restart the original operation the intent described. >> >> Indeed, the operations we perform recovering an attr intent are >> actually different to the operations we perform with those intents >> at runtime. Recovery has to perform garbage collection as a first >> step, yet runtime does not. >> >> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is >> typically performing garbage collection in the first recovery >> transaction. If it does perform a gc operation, then it is >> guaranteed to return -EAGAIN so that the actual operation recovery >> needs to perform can have new intents logged, be deferred and then >> eventually replayed. >> >> If we don't need to perform gc, then the operation we perform may >> still require multiple transactions to complete and we get -EAGAIN >> in that case, too. When that happens, we need to ensure if we crash >> before recovery completes that the next mount will perform the >> correct GC steps before it replays the intent from scratch. So we >> always need to log an intent for the xattr operation that will >> trigger the GC step on recovery. >> >> Either way, we need to log new intents that are identical to the >> original one at each step of the process so that log recovery will >> always start the intent recovery process with the correct garbage >> collection operation.... >> >>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and >>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the >>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item() >>> is called. >>> Do you agree with above? >>> >>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI >>> and is processed. attr2 and attr1 contain exactly same thing because they are both >>> from the new ATTRI. So the processing on attr2 is pure a duplication of the processing >>> on attr1. So actually the new ATTRI are double-processed. >>> Do you agree with above? >> >> No, that's wrong. There is now "attr1 and attr2" structures - they >> are one and the same. i.e. the xfs_attr_intent used in >> xfs_attr_finish_item() is the same one created in >> xfs_attri_item_recover() that we called >> xfs_defer_add(&attr->xattri_list) with way back at line 638. >> >> xfs_attr_finish_item( >> struct xfs_trans *tp, >> struct xfs_log_item *done, >> struct list_head *item, >> struct xfs_btree_cur **state) >> { >> struct xfs_attr_intent *attr; >> struct xfs_attrd_log_item *done_item = NULL; >> int error; >> >>>>>>> attr = container_of(item, struct xfs_attr_intent, xattri_list); >> if (done) >> done_item = ATTRD_ITEM(done); >> >> xfs_attr_finish_item() extracts the @attr structure from the >> listhead that is linking it into the deferred op chain, and we >> run xfs_attr_set_iter() again to run the next step(s) in the xattr >> modification state machine. If this returns -EAGAIN (and it can) >> we go around the "defer(attr->attri_list), create new intent, >> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again. >> >> From this, I am guessing that you are confusing the per-defer-cycle >> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We >> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every >> time we create new intents in the deferred operation chain. Each >> xfs_attri_log_item is created from the state that is carried >> across entire the attr recovery chain in the xfs_attr_intent >> structure.... > > OK, ATTRI really looks complex. And it looks like the > xfs_attri_item_recover() and xfs_attr_finish_item() can do different things > according the current state. Thus running both xfs_attri_item_recover() > and xfs_attr_finish_item() has no problem. I didn’t take a good example :(. > > But, the original > ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0); > in xlog_recover_process_intents() is incorrect, right? For example if we > introduced two different new ATTRIs and the 2nd should have li_lsn bigger > than last_lsn? > > Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine. > xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing — > free the extent specified by the EFI record(s). For the new EFI, the > xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it > want to free the extent which is already freed by the xfs_efi_item_recover(). > We pass some information (indicating it already done) from xfs_efi_item_recover() to > xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do > the 2nd free as a workaround? What’s your idea? > > thanks, > wengang