Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 26 May 2023 18:59:41 +0000

> On May 25, 2023, at 5:26 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>>>> (but reply on the intents on AIL).
>>>>>>> 
>>>>>>> All of the intent recovery stuff uses
>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>>>> replayed and cause all further new intent processing in that chain
>>>>>>> to be defered until after all the intents recovered from the journal
>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>>>> index >= last_lsn.
>>>>>> 
>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>>>> and shall we put this change to another separated patch as it is to fix
>>>>>> an existing problem (not introduced by my patch)?
>>>>> 
>>>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>>>> would be possible to break log recovery if an intent's iop_recover
>>>>> method immediately logged a new intent and returned EAGAIN to roll the
>>>>> transaction, but none of them do that;
>>>> 
>>>> I am not quite sure for above. There are cases that new intents are added
>>>> in iop_recover(), for example xfs_attri_item_recover():
>>>> 
>>>> 632         error = xfs_xattri_finish_update(attr, done_item);
>>>> 633         if (error == -EAGAIN) {
>>>> 634                 /*
>>>> 635                  * There's more work to do, so add the intent item to this
>>>> 636                  * transaction so that we can continue it later.
>>>> 637                  */
>>>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>>>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>>>> 640                 if (error)
>>>> 641                         goto out_unlock;
>>>> 642
>>>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>>> 644                 xfs_irele(ip);
>>>> 645                 return 0;
>>>> 646         }
>>>> 
>>>> I am thinking line 638 and 639 are doing so.
>>> 
>>> I don't think so. @attrip is the attribute information recovered
>>> from the original intent - we allocated @attr in
>>> xfs_attri_item_recover() to enable the operation indicated by
>>> @attrip to be executed.  We copy stuff from @attrip (the item we are
>>> recovering) to the new @attr structure (the work we need to do) and
>>> run it through the attr modification state machine. If there's more
>>> work to be done, we then add @attr to the defer list.
>> 
>> Assuming “if there’s more work to be done” is the -EAGAIN case...
>> 
>>> 
>>> If there is deferred work to be continued, we still need a new
>>> intent to indicate what attr operation needs to be performed next in
>>> the chain.  When we commit the transaction (639), it will create a
>>> new intent log item for the deferred attribute operation in the
>>> ->create_intent callback before the transaction is committed.
>> 
>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
>> to say is that:
>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
>> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
>> Do you agree with above?
> 
> Recovery of intents has *always* been able to do this. What I don't
> understand is why you think this is a problem.

Adding new intent is not a problem. 
The problem I am thinking is that for the NEW intents,
we should’d run both:
1. iop_recover() during the AIL intents iteration and
2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls()

I am thinking the iop_recover() and the finish_item() are doing same thing.
I am not quite sure about ATTRI but  I am sure about EFI. For EFI,
both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item()
(as finish_item()) calls xfs_trans_free_extent() to free the same extent(s).

I have a metadump with which I can reproduce the original AGFL deadlock problem
during log recovery.  When I tested my patch (without the changes in
xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice,
one is during the AIL intents iteration with iop_recover(), the other is inside
xlog_finish_defer_ops() with finish_item().

That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents().
As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents
only. Those original redo intents has no deferred option attached.  And for the NEW intents
introduced during the processing of the original intents, I don’t think iop_recover() should
run for them becuase
1. they are NEW ones, not the original on disk redo intents and 
2. they have deferred operation attached which process the new intents later.

And I am also thinking we also depend  xlog_recover_process_intents() to run iop_recover()
on the new intents. That’s because the new intents could be already inserted to AIL
before the AIL intents iteration ends, in that case have the change to run iop_recover() on the
new ones.  but the new intents also could be inserted to AIL after the AIL intents iteration ends,
in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing.

> 
>> In the following I will talk about the new intent log item. 
>> 
>> The iop_recover() is then executed on the new intent during the AIL intents iteration.
>> I meant this loop:
>> 
>> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>> 2549              lip != NULL;
>> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>> ….
>> 2584         }
>> Do you agree with above?
> 
> Repeatedly asserting that I must agree that your logic is correct
> comes across as unecessarily aggressive and combative and is not
> acceptible behaviour.
> 
> Aggressively asserting that you are right doesn't make your logic
> any more compelling. It just makes you look really bad when,
> inevitably, you get politely told that you've made yet another
> mistake.

Nope, I didn’t want to be aggressive. Since this is already discussed a lot
but we still don’t have an agreement. So I wanted to know where is the disagreement,
little by little. I thought it would be a good way to make things clear,
if it made you unhappy, sorry, and I didn’t mean that.

> 
>> And the lsn for the new intent is equal to or bigger than last_lsn.
>> Do you agree with above?
>> 
>> In above case the iop_recover() is xfs_attri_item_recover(). The later
>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
>> and process attr1.
>> Do you agree with above?
>> 
>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
> 
> We ran xfs_xattri_finish_update(attr, done_item) which returned
> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified
> -something- before we logged the new intent. That, however, doesn't
> dictate what we need to log in the new intent - that is determined
> by the overall intent chain and recovery architecture the subsystem
> uses. And the attribute modification architecture is extremely
> subtle, complex and full of unexpected loops.

I am a bit confused here. 
I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing
(free the extent specified in the EFI record(s)). So whatever the new would be logged int
the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover()
and xfs_attr_finish_item() should do the same thing. If they don’t not,  ATTIR has inconsistent
behavior with EFI.

> 
>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
>> capture_list contains the deferred operation which was added at line 639 with type
>> XFS_DEFER_OPS_TYPE_ATTR. 
>> Do you agree with above?
>> 
>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
>> xfs_trans_commit().
>> Do you agree with above?
> 
> So at this point it we are logging the same intent *information*
> again. But it's not the same intent - we've retired the original
> intent and we have a new intent chain for the operations we are
> going to perform. But why are we logging the same information in a
> new intent?
> 
> Well, the attr modification state machine was -carefully- designed
> to work this way: if we crash part way through an intent chain, we
> always need to restart it in recovery with a "garbage collection"
> step that undoes the partially complete changes that had been made
> before we crashed. Only once we've done that garbage collection step
> can we restart the original operation the intent described.
> 
> Indeed, the operations we perform recovering an attr intent are
> actually different to the operations we perform with those intents
> at runtime. Recovery has to perform garbage collection as a first
> step, yet runtime does not.
> 
> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
> typically performing garbage collection in the first recovery
> transaction.  If it does perform a gc operation, then it is
> guaranteed to return -EAGAIN so that the actual operation recovery
> needs to perform can have new intents logged, be deferred and then
> eventually replayed.
> 
> If we don't need to perform gc, then the operation we perform may
> still require multiple transactions to complete and we get -EAGAIN
> in that case, too. When that happens, we need to ensure if we crash
> before recovery completes that the next mount will perform the
> correct GC steps before it replays the intent from scratch. So we
> always need to log an intent for the xattr operation that will
> trigger the GC step on recovery.
> 
> Either way, we need to log new intents that are identical to the
> original one at each step of the process so that log recovery will
> always start the intent recovery process with the correct garbage
> collection operation....
> 
>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
>> is called.
>> Do you agree with above?
>> 
>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
>> and is processed.  attr2 and attr1 contain exactly same thing because they are both
>> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
>> on attr1.  So actually the new ATTRI are double-processed.
>> Do you agree with above? 
> 
> No, that's wrong. There is now "attr1 and attr2" structures - they
> are one and the same. i.e. the xfs_attr_intent used in
> xfs_attr_finish_item() is the same one created in
> xfs_attri_item_recover() that we called
> xfs_defer_add(&attr->xattri_list) with way back at line 638. 
> 
> xfs_attr_finish_item(
>        struct xfs_trans                *tp,
>        struct xfs_log_item             *done,
>        struct list_head                *item,
>        struct xfs_btree_cur            **state)
> {
>        struct xfs_attr_intent          *attr;
>        struct xfs_attrd_log_item       *done_item = NULL;
>        int                             error;
> 
>>>>>>  attr = container_of(item, struct xfs_attr_intent, xattri_list);
>        if (done)
>                done_item = ATTRD_ITEM(done);
> 
> xfs_attr_finish_item() extracts the @attr structure from the
> listhead that is linking it into the deferred op chain, and we
> run xfs_attr_set_iter() again to run the next step(s) in the xattr
> modification state machine. If this returns -EAGAIN (and it can)
> we go around the "defer(attr->attri_list), create new intent,
> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.
> 
> From this, I am guessing that you are confusing the per-defer-cycle
> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
> time we create new intents in the deferred operation chain. Each
> xfs_attri_log_item is created from the state that is carried
> across entire the attr recovery chain in the xfs_attr_intent
> structure....

OK, ATTRI really looks complex. And it looks like the
xfs_attri_item_recover() and xfs_attr_finish_item() can do different things
according the current state. Thus running both xfs_attri_item_recover()
and xfs_attr_finish_item() has no problem.  I didn’t take a good example :(. 

But, the original 
ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
in xlog_recover_process_intents() is incorrect, right? For example if we
introduced two different new ATTRIs and the 2nd should have li_lsn bigger
than last_lsn?

Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine.
xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing —
free the extent specified by the EFI record(s).  For the new EFI, the
xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it
want to free the extent which is already freed by the xfs_efi_item_recover(). 
We pass some information (indicating it already done) from xfs_efi_item_recover() to
xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do
the 2nd free as a workaround? What’s your idea?

thanks,
wengang