Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 26 May 2023 21:31:19 +0000



> On May 26, 2023, at 11:59 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> 
> 
> 
>> On May 25, 2023, at 5:26 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> 
>> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
>>> 
>>> 
>>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>>>>> (but reply on the intents on AIL).
>>>>>>>> 
>>>>>>>> All of the intent recovery stuff uses
>>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>>>>> replayed and cause all further new intent processing in that chain
>>>>>>>> to be defered until after all the intents recovered from the journal
>>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>>>>> index >= last_lsn.
>>>>>>> 
>>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>>>>> and shall we put this change to another separated patch as it is to fix
>>>>>>> an existing problem (not introduced by my patch)?
>>>>>> 
>>>>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>>>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>>>>> would be possible to break log recovery if an intent's iop_recover
>>>>>> method immediately logged a new intent and returned EAGAIN to roll the
>>>>>> transaction, but none of them do that;
>>>>> 
>>>>> I am not quite sure for above. There are cases that new intents are added
>>>>> in iop_recover(), for example xfs_attri_item_recover():
>>>>> 
>>>>> 632         error = xfs_xattri_finish_update(attr, done_item);
>>>>> 633         if (error == -EAGAIN) {
>>>>> 634                 /*
>>>>> 635                  * There's more work to do, so add the intent item to this
>>>>> 636                  * transaction so that we can continue it later.
>>>>> 637                  */
>>>>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>>>>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>>>>> 640                 if (error)
>>>>> 641                         goto out_unlock;
>>>>> 642
>>>>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>>>> 644                 xfs_irele(ip);
>>>>> 645                 return 0;
>>>>> 646         }
>>>>> 
>>>>> I am thinking line 638 and 639 are doing so.
>>>> 
>>>> I don't think so. @attrip is the attribute information recovered
>>>> from the original intent - we allocated @attr in
>>>> xfs_attri_item_recover() to enable the operation indicated by
>>>> @attrip to be executed.  We copy stuff from @attrip (the item we are
>>>> recovering) to the new @attr structure (the work we need to do) and
>>>> run it through the attr modification state machine. If there's more
>>>> work to be done, we then add @attr to the defer list.
>>> 
>>> Assuming “if there’s more work to be done” is the -EAGAIN case...
>>> 
>>>> 
>>>> If there is deferred work to be continued, we still need a new
>>>> intent to indicate what attr operation needs to be performed next in
>>>> the chain.  When we commit the transaction (639), it will create a
>>>> new intent log item for the deferred attribute operation in the
>>>> ->create_intent callback before the transaction is committed.
>>> 
>>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
>>> to say is that:
>>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
>>> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
>>> Do you agree with above?
>> 
>> Recovery of intents has *always* been able to do this. What I don't
>> understand is why you think this is a problem.
> 
> Adding new intent is not a problem. 
> The problem I am thinking is that for the NEW intents,
> we should’d run both:
I meant “shouldn’t”

thanks,
wengang

> 1. iop_recover() during the AIL intents iteration and
> 2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls()
> 
> I am thinking the iop_recover() and the finish_item() are doing same thing.
> I am not quite sure about ATTRI but  I am sure about EFI. For EFI,
> both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item()
> (as finish_item()) calls xfs_trans_free_extent() to free the same extent(s).
> 
> I have a metadump with which I can reproduce the original AGFL deadlock problem
> during log recovery.  When I tested my patch (without the changes in
> xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice,
> one is during the AIL intents iteration with iop_recover(), the other is inside
> xlog_finish_defer_ops() with finish_item().
> 
> That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents().
> As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents
> only. Those original redo intents has no deferred option attached.  And for the NEW intents
> introduced during the processing of the original intents, I don’t think iop_recover() should
> run for them becuase
> 1. they are NEW ones, not the original on disk redo intents and 
> 2. they have deferred operation attached which process the new intents later.
> 
> And I am also thinking we also depend  xlog_recover_process_intents() to run iop_recover()
> on the new intents. That’s because the new intents could be already inserted to AIL
> before the AIL intents iteration ends, in that case have the change to run iop_recover() on the
> new ones.  but the new intents also could be inserted to AIL after the AIL intents iteration ends,
> in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing.
> 
> 
>> 
>>> In the following I will talk about the new intent log item. 
>>> 
>>> The iop_recover() is then executed on the new intent during the AIL intents iteration.
>>> I meant this loop:
>>> 
>>> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>> 2549              lip != NULL;
>>> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>>> ….
>>> 2584         }
>>> Do you agree with above?
>> 
>> Repeatedly asserting that I must agree that your logic is correct
>> comes across as unecessarily aggressive and combative and is not
>> acceptible behaviour.
>> 
>> Aggressively asserting that you are right doesn't make your logic
>> any more compelling. It just makes you look really bad when,
>> inevitably, you get politely told that you've made yet another
>> mistake.
> 
> Nope, I didn’t want to be aggressive. Since this is already discussed a lot
> but we still don’t have an agreement. So I wanted to know where is the disagreement,
> little by little. I thought it would be a good way to make things clear,
> if it made you unhappy, sorry, and I didn’t mean that.
> 
>> 
>>> And the lsn for the new intent is equal to or bigger than last_lsn.
>>> Do you agree with above?
>>> 
>>> In above case the iop_recover() is xfs_attri_item_recover(). The later
>>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
>>> and process attr1.
>>> Do you agree with above?
>>> 
>>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
>> 
>> We ran xfs_xattri_finish_update(attr, done_item) which returned
>> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified
>> -something- before we logged the new intent. That, however, doesn't
>> dictate what we need to log in the new intent - that is determined
>> by the overall intent chain and recovery architecture the subsystem
>> uses. And the attribute modification architecture is extremely
>> subtle, complex and full of unexpected loops.
> 
> I am a bit confused here. 
> I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing
> (free the extent specified in the EFI record(s)). So whatever the new would be logged int
> the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover()
> and xfs_attr_finish_item() should do the same thing. If they don’t not,  ATTIR has inconsistent
> behavior with EFI.
> 
>> 
>>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
>>> capture_list contains the deferred operation which was added at line 639 with type
>>> XFS_DEFER_OPS_TYPE_ATTR. 
>>> Do you agree with above?
>>> 
>>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
>>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
>>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
>>> xfs_trans_commit().
>>> Do you agree with above?
>> 
>> So at this point it we are logging the same intent *information*
>> again. But it's not the same intent - we've retired the original
>> intent and we have a new intent chain for the operations we are
>> going to perform. But why are we logging the same information in a
>> new intent?
>> 
>> Well, the attr modification state machine was -carefully- designed
>> to work this way: if we crash part way through an intent chain, we
>> always need to restart it in recovery with a "garbage collection"
>> step that undoes the partially complete changes that had been made
>> before we crashed. Only once we've done that garbage collection step
>> can we restart the original operation the intent described.
>> 
>> Indeed, the operations we perform recovering an attr intent are
>> actually different to the operations we perform with those intents
>> at runtime. Recovery has to perform garbage collection as a first
>> step, yet runtime does not.
>> 
>> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
>> typically performing garbage collection in the first recovery
>> transaction.  If it does perform a gc operation, then it is
>> guaranteed to return -EAGAIN so that the actual operation recovery
>> needs to perform can have new intents logged, be deferred and then
>> eventually replayed.
>> 
>> If we don't need to perform gc, then the operation we perform may
>> still require multiple transactions to complete and we get -EAGAIN
>> in that case, too. When that happens, we need to ensure if we crash
>> before recovery completes that the next mount will perform the
>> correct GC steps before it replays the intent from scratch. So we
>> always need to log an intent for the xattr operation that will
>> trigger the GC step on recovery.
>> 
>> Either way, we need to log new intents that are identical to the
>> original one at each step of the process so that log recovery will
>> always start the intent recovery process with the correct garbage
>> collection operation....
>> 
>>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
>>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
>>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
>>> is called.
>>> Do you agree with above?
>>> 
>>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
>>> and is processed.  attr2 and attr1 contain exactly same thing because they are both
>>> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
>>> on attr1.  So actually the new ATTRI are double-processed.
>>> Do you agree with above? 
>> 
>> No, that's wrong. There is now "attr1 and attr2" structures - they
>> are one and the same. i.e. the xfs_attr_intent used in
>> xfs_attr_finish_item() is the same one created in
>> xfs_attri_item_recover() that we called
>> xfs_defer_add(&attr->xattri_list) with way back at line 638. 
>> 
>> xfs_attr_finish_item(
>>       struct xfs_trans                *tp,
>>       struct xfs_log_item             *done,
>>       struct list_head                *item,
>>       struct xfs_btree_cur            **state)
>> {
>>       struct xfs_attr_intent          *attr;
>>       struct xfs_attrd_log_item       *done_item = NULL;
>>       int                             error;
>> 
>>>>>>> attr = container_of(item, struct xfs_attr_intent, xattri_list);
>>       if (done)
>>               done_item = ATTRD_ITEM(done);
>> 
>> xfs_attr_finish_item() extracts the @attr structure from the
>> listhead that is linking it into the deferred op chain, and we
>> run xfs_attr_set_iter() again to run the next step(s) in the xattr
>> modification state machine. If this returns -EAGAIN (and it can)
>> we go around the "defer(attr->attri_list), create new intent,
>> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.
>> 
>> From this, I am guessing that you are confusing the per-defer-cycle
>> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
>> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
>> time we create new intents in the deferred operation chain. Each
>> xfs_attri_log_item is created from the state that is carried
>> across entire the attr recovery chain in the xfs_attr_intent
>> structure....
> 
> OK, ATTRI really looks complex. And it looks like the
> xfs_attri_item_recover() and xfs_attr_finish_item() can do different things
> according the current state. Thus running both xfs_attri_item_recover()
> and xfs_attr_finish_item() has no problem.  I didn’t take a good example :(. 
> 
> But, the original 
> ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> in xlog_recover_process_intents() is incorrect, right? For example if we
> introduced two different new ATTRIs and the 2nd should have li_lsn bigger
> than last_lsn?
> 
> Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine.
> xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing —
> free the extent specified by the EFI record(s).  For the new EFI, the
> xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it
> want to free the extent which is already freed by the xfs_efi_item_recover(). 
> We pass some information (indicating it already done) from xfs_efi_item_recover() to
> xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do
> the 2nd free as a workaround? What’s your idea?
> 
> thanks,
> wengang