Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Sat, 3 Jun 2023 00:26:28 +0000

Hi Dave and Darrick,

What do you think?

thanks,
wengang

> On May 26, 2023, at 2:31 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> 
> 
> 
>> On May 26, 2023, at 11:59 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On May 25, 2023, at 5:26 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
>>>> 
>>>> 
>>>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>>>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>>>>>> (but reply on the intents on AIL).
>>>>>>>>> 
>>>>>>>>> All of the intent recovery stuff uses
>>>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>>>>>> replayed and cause all further new intent processing in that chain
>>>>>>>>> to be defered until after all the intents recovered from the journal
>>>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>>>>>> index >= last_lsn.
>>>>>>>> 
>>>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>>>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>>>>>> and shall we put this change to another separated patch as it is to fix
>>>>>>>> an existing problem (not introduced by my patch)?
>>>>>>> 
>>>>>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>>>>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>>>>>> would be possible to break log recovery if an intent's iop_recover
>>>>>>> method immediately logged a new intent and returned EAGAIN to roll the
>>>>>>> transaction, but none of them do that;
>>>>>> 
>>>>>> I am not quite sure for above. There are cases that new intents are added
>>>>>> in iop_recover(), for example xfs_attri_item_recover():
>>>>>> 
>>>>>> 632         error = xfs_xattri_finish_update(attr, done_item);
>>>>>> 633         if (error == -EAGAIN) {
>>>>>> 634                 /*
>>>>>> 635                  * There's more work to do, so add the intent item to this
>>>>>> 636                  * transaction so that we can continue it later.
>>>>>> 637                  */
>>>>>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>>>>>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>>>>>> 640                 if (error)
>>>>>> 641                         goto out_unlock;
>>>>>> 642
>>>>>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>>>>> 644                 xfs_irele(ip);
>>>>>> 645                 return 0;
>>>>>> 646         }
>>>>>> 
>>>>>> I am thinking line 638 and 639 are doing so.
>>>>> 
>>>>> I don't think so. @attrip is the attribute information recovered
>>>>> from the original intent - we allocated @attr in
>>>>> xfs_attri_item_recover() to enable the operation indicated by
>>>>> @attrip to be executed.  We copy stuff from @attrip (the item we are
>>>>> recovering) to the new @attr structure (the work we need to do) and
>>>>> run it through the attr modification state machine. If there's more
>>>>> work to be done, we then add @attr to the defer list.
>>>> 
>>>> Assuming “if there’s more work to be done” is the -EAGAIN case...
>>>> 
>>>>> 
>>>>> If there is deferred work to be continued, we still need a new
>>>>> intent to indicate what attr operation needs to be performed next in
>>>>> the chain.  When we commit the transaction (639), it will create a
>>>>> new intent log item for the deferred attribute operation in the
>>>>> ->create_intent callback before the transaction is committed.
>>>> 
>>>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
>>>> to say is that:
>>>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
>>>> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
>>>> Do you agree with above?
>>> 
>>> Recovery of intents has *always* been able to do this. What I don't
>>> understand is why you think this is a problem.
>> 
>> Adding new intent is not a problem. 
>> The problem I am thinking is that for the NEW intents,
>> we should’d run both:
> I meant “shouldn’t”
> 
> thanks,
> wengang
> 
>> 1. iop_recover() during the AIL intents iteration and
>> 2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls()
>> 
>> I am thinking the iop_recover() and the finish_item() are doing same thing.
>> I am not quite sure about ATTRI but  I am sure about EFI. For EFI,
>> both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item()
>> (as finish_item()) calls xfs_trans_free_extent() to free the same extent(s).
>> 
>> I have a metadump with which I can reproduce the original AGFL deadlock problem
>> during log recovery.  When I tested my patch (without the changes in
>> xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice,
>> one is during the AIL intents iteration with iop_recover(), the other is inside
>> xlog_finish_defer_ops() with finish_item().
>> 
>> That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents().
>> As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents
>> only. Those original redo intents has no deferred option attached.  And for the NEW intents
>> introduced during the processing of the original intents, I don’t think iop_recover() should
>> run for them becuase
>> 1. they are NEW ones, not the original on disk redo intents and 
>> 2. they have deferred operation attached which process the new intents later.
>> 
>> And I am also thinking we also depend  xlog_recover_process_intents() to run iop_recover()
>> on the new intents. That’s because the new intents could be already inserted to AIL
>> before the AIL intents iteration ends, in that case have the change to run iop_recover() on the
>> new ones.  but the new intents also could be inserted to AIL after the AIL intents iteration ends,
>> in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing.
>> 
>> 
>>> 
>>>> In the following I will talk about the new intent log item. 
>>>> 
>>>> The iop_recover() is then executed on the new intent during the AIL intents iteration.
>>>> I meant this loop:
>>>> 
>>>> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>>> 2549              lip != NULL;
>>>> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>>>> ….
>>>> 2584         }
>>>> Do you agree with above?
>>> 
>>> Repeatedly asserting that I must agree that your logic is correct
>>> comes across as unecessarily aggressive and combative and is not
>>> acceptible behaviour.
>>> 
>>> Aggressively asserting that you are right doesn't make your logic
>>> any more compelling. It just makes you look really bad when,
>>> inevitably, you get politely told that you've made yet another
>>> mistake.
>> 
>> Nope, I didn’t want to be aggressive. Since this is already discussed a lot
>> but we still don’t have an agreement. So I wanted to know where is the disagreement,
>> little by little. I thought it would be a good way to make things clear,
>> if it made you unhappy, sorry, and I didn’t mean that.
>> 
>>> 
>>>> And the lsn for the new intent is equal to or bigger than last_lsn.
>>>> Do you agree with above?
>>>> 
>>>> In above case the iop_recover() is xfs_attri_item_recover(). The later
>>>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
>>>> and process attr1.
>>>> Do you agree with above?
>>>> 
>>>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
>>> 
>>> We ran xfs_xattri_finish_update(attr, done_item) which returned
>>> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified
>>> -something- before we logged the new intent. That, however, doesn't
>>> dictate what we need to log in the new intent - that is determined
>>> by the overall intent chain and recovery architecture the subsystem
>>> uses. And the attribute modification architecture is extremely
>>> subtle, complex and full of unexpected loops.
>> 
>> I am a bit confused here. 
>> I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing
>> (free the extent specified in the EFI record(s)). So whatever the new would be logged int
>> the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover()
>> and xfs_attr_finish_item() should do the same thing. If they don’t not,  ATTIR has inconsistent
>> behavior with EFI.
>> 
>>> 
>>>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
>>>> capture_list contains the deferred operation which was added at line 639 with type
>>>> XFS_DEFER_OPS_TYPE_ATTR. 
>>>> Do you agree with above?
>>>> 
>>>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
>>>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
>>>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
>>>> xfs_trans_commit().
>>>> Do you agree with above?
>>> 
>>> So at this point it we are logging the same intent *information*
>>> again. But it's not the same intent - we've retired the original
>>> intent and we have a new intent chain for the operations we are
>>> going to perform. But why are we logging the same information in a
>>> new intent?
>>> 
>>> Well, the attr modification state machine was -carefully- designed
>>> to work this way: if we crash part way through an intent chain, we
>>> always need to restart it in recovery with a "garbage collection"
>>> step that undoes the partially complete changes that had been made
>>> before we crashed. Only once we've done that garbage collection step
>>> can we restart the original operation the intent described.
>>> 
>>> Indeed, the operations we perform recovering an attr intent are
>>> actually different to the operations we perform with those intents
>>> at runtime. Recovery has to perform garbage collection as a first
>>> step, yet runtime does not.
>>> 
>>> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
>>> typically performing garbage collection in the first recovery
>>> transaction.  If it does perform a gc operation, then it is
>>> guaranteed to return -EAGAIN so that the actual operation recovery
>>> needs to perform can have new intents logged, be deferred and then
>>> eventually replayed.
>>> 
>>> If we don't need to perform gc, then the operation we perform may
>>> still require multiple transactions to complete and we get -EAGAIN
>>> in that case, too. When that happens, we need to ensure if we crash
>>> before recovery completes that the next mount will perform the
>>> correct GC steps before it replays the intent from scratch. So we
>>> always need to log an intent for the xattr operation that will
>>> trigger the GC step on recovery.
>>> 
>>> Either way, we need to log new intents that are identical to the
>>> original one at each step of the process so that log recovery will
>>> always start the intent recovery process with the correct garbage
>>> collection operation....
>>> 
>>>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
>>>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
>>>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
>>>> is called.
>>>> Do you agree with above?
>>>> 
>>>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
>>>> and is processed.  attr2 and attr1 contain exactly same thing because they are both
>>>> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
>>>> on attr1.  So actually the new ATTRI are double-processed.
>>>> Do you agree with above? 
>>> 
>>> No, that's wrong. There is now "attr1 and attr2" structures - they
>>> are one and the same. i.e. the xfs_attr_intent used in
>>> xfs_attr_finish_item() is the same one created in
>>> xfs_attri_item_recover() that we called
>>> xfs_defer_add(&attr->xattri_list) with way back at line 638. 
>>> 
>>> xfs_attr_finish_item(
>>>      struct xfs_trans                *tp,
>>>      struct xfs_log_item             *done,
>>>      struct list_head                *item,
>>>      struct xfs_btree_cur            **state)
>>> {
>>>      struct xfs_attr_intent          *attr;
>>>      struct xfs_attrd_log_item       *done_item = NULL;
>>>      int                             error;
>>> 
>>>>>>>> attr = container_of(item, struct xfs_attr_intent, xattri_list);
>>>      if (done)
>>>              done_item = ATTRD_ITEM(done);
>>> 
>>> xfs_attr_finish_item() extracts the @attr structure from the
>>> listhead that is linking it into the deferred op chain, and we
>>> run xfs_attr_set_iter() again to run the next step(s) in the xattr
>>> modification state machine. If this returns -EAGAIN (and it can)
>>> we go around the "defer(attr->attri_list), create new intent,
>>> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.
>>> 
>>> From this, I am guessing that you are confusing the per-defer-cycle
>>> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
>>> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
>>> time we create new intents in the deferred operation chain. Each
>>> xfs_attri_log_item is created from the state that is carried
>>> across entire the attr recovery chain in the xfs_attr_intent
>>> structure....
>> 
>> OK, ATTRI really looks complex. And it looks like the
>> xfs_attri_item_recover() and xfs_attr_finish_item() can do different things
>> according the current state. Thus running both xfs_attri_item_recover()
>> and xfs_attr_finish_item() has no problem.  I didn’t take a good example :(. 
>> 
>> But, the original 
>> ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
>> in xlog_recover_process_intents() is incorrect, right? For example if we
>> introduced two different new ATTRIs and the 2nd should have li_lsn bigger
>> than last_lsn?
>> 
>> Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine.
>> xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing —
>> free the extent specified by the EFI record(s).  For the new EFI, the
>> xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it
>> want to free the extent which is already freed by the xfs_efi_item_recover(). 
>> We pass some information (indicating it already done) from xfs_efi_item_recover() to
>> xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do
>> the 2nd free as a workaround? What’s your idea?
>> 
>> thanks,
>> wengang