Re: [PATCH] xfs: Don't block in xfs_extent_busy_flush

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Thu, 25 May 2023 17:13:41 +0000

> On May 24, 2023, at 7:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>> (but reply on the intents on AIL).
>>>>> 
>>>>> All of the intent recovery stuff uses
>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>> replayed and cause all further new intent processing in that chain
>>>>> to be defered until after all the intents recovered from the journal
>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>> index >= last_lsn.
>>>> 
>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>> and shall we put this change to another separated patch as it is to fix
>>>> an existing problem (not introduced by my patch)?
>>> 
>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>> would be possible to break log recovery if an intent's iop_recover
>>> method immediately logged a new intent and returned EAGAIN to roll the
>>> transaction, but none of them do that;
>> 
>> I am not quite sure for above. There are cases that new intents are added
>> in iop_recover(), for example xfs_attri_item_recover():
>> 
>> 632         error = xfs_xattri_finish_update(attr, done_item);
>> 633         if (error == -EAGAIN) {
>> 634                 /*
>> 635                  * There's more work to do, so add the intent item to this
>> 636                  * transaction so that we can continue it later.
>> 637                  */
>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>> 640                 if (error)
>> 641                         goto out_unlock;
>> 642
>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> 644                 xfs_irele(ip);
>> 645                 return 0;
>> 646         }
>> 
>> I am thinking line 638 and 639 are doing so.
> 
> I don't think so. @attrip is the attribute information recovered
> from the original intent - we allocated @attr in
> xfs_attri_item_recover() to enable the operation indicated by
> @attrip to be executed.  We copy stuff from @attrip (the item we are
> recovering) to the new @attr structure (the work we need to do) and
> run it through the attr modification state machine. If there's more
> work to be done, we then add @attr to the defer list.

Assuming “if there’s more work to be done” is the -EAGAIN case...

> 
> If there is deferred work to be continued, we still need a new
> intent to indicate what attr operation needs to be performed next in
> the chain.  When we commit the transaction (639), it will create a
> new intent log item for the deferred attribute operation in the
> ->create_intent callback before the transaction is committed.

Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
to say is that:
The new intent log item (introduced by xfs_attri_item_recover()) could appear on
AIL before the AIL iteration ends in xlog_recover_process_intents(). 
Do you agree with above?

In the following I will talk about the new intent log item. 

The iop_recover() is then executed on the new intent during the AIL intents iteration.
I meant this loop:

2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
2549              lip != NULL;
2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
….
2584         }
Do you agree with above?

And the lsn for the new intent is equal to or bigger than last_lsn.
Do you agree with above?

In above case the iop_recover() is xfs_attri_item_recover(). The later
creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
and process attr1.
Do you agree with above?

Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
capture_list contains the deferred operation which was added at line 639 with type
XFS_DEFER_OPS_TYPE_ATTR. 
Do you agree with above?

In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
xfs_trans_commit().
Do you agree with above?

During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
is called.
Do you agree with above?

In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
and is processed.  attr2 and attr1 contain exactly same thing because they are both
from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
on attr1.  So actually the new ATTRI are double-processed.
Do you agree with above? 

> 
> IOWs, we never re-use the incoming intent item that we are
> recovering. The new log item will end up in the AIL at/beyond
> last_lsn when the CIL is committed. It does not get further
> processing work done until all the intents in the log that need
> recovery have had their initial processing performed and the log
> space they consume has been freed up.
> 
>>> and I think the ASSERT you moved would detect such a thing.
>> 
>> ASSERT is nothing in production kernel, so it has less chance to
>> detect things.
> 
> Please understand that we do actually know how the ASSERT
> infrastructure works and we utilise it to our advantage in many
> ways.  We often use asserts to document design/code constraints and
> use debug kernels to perform runtime design rule violation
> detection.
> 
> Indeed, we really don't want design/code constraints being checked on
> production kernels, largely because we know they are never going to
> be tripped in normal production system operation. IOWs, these checks
> are unnecessary in production systems because we've already done all
> the constraint/correctness checking of the code on debug kernels
> before we release the software to production.
> 
> If a particular situation is a production concern, we code it as an
> error check, not an ASSERT. If it's a design or implementation
> constraint check, it's an ASSERT. The last_lsn ASSERT is checking
> that the code is behaving according to design constraints (i.e.
> CIL/AIL ordering has not been screwed up, intent recovery has not
> changed behaviour, and new objects always appear at >= last_lsn).
> None of these things should ever occur in a production system - if
> any of them occur do then we'll have much, much worse problems to
> address than log recovery maybe running an intent too early.

Thanks for the explain on the ASSERT, I have no double on it. I was trying to to say
why it didn’t capture the problem in xlog_recover_process_intents().

thanks,
wengang