Re: [PATCH 1/2] xfs: IO time one extent per EFI

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Mon, 24 Apr 2023 15:53:08 +0000

> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>> We don't do that anymore for partially processed multi-extent
>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>>> we log a complete intent done items alongside a new intent
>>>>> containing the remaining work to be done in the same transaction.
>>>>> This cancels the original intent and atomically replaces it with a
>>>>> new intent containing the remaining work to be done.
>>>>> 
>>>>> So when I say "update the EFD" I'm using historic terminology for
>>>>> processing and recovering multi-extent intents. In modern terms,
>>>>> what I mean is "update the deferred work intent chain to reflect the
>>>>> work remaining to be done".
>>>> 
>>>> OK. so let’s see the difference between your implementation from mine.
>>>> Say, there are three extents to free.
>>>> 
>>>> My patch would result in:
>>>> 
>>>> EFI 1  with extent1
>>>> free extent1
>>>> EFD 1 with extent1
>>>> transaction roll
>>>> EFI 2 with extent2
>>>> free extent2
>>>> EFD 2 with extent2
>>>> transaction roll
>>>> EFI 3 with extent3
>>>> free extent3
>>>> EFD3 with extent3
>>>> transaction commit
>>> 
>>> No, it wouldn't. This isn't how the deferred work processes work
>>> items on the transaction. A work item with multiple extents on it
>>> would result in this:
>>> 
>>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>>> xfs_defer_finish_noroll
>>>  xfs_defer_create_intents()
>>>    list_for_each_defer_pending
>>>      xfs_defer_create_intent(dfp)
>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>>  xfs_extent_free_create_intent()
>>>    <create EFI>
>>>    list_for_each_xefi
>>>      xfs_extent_free_log_item(xefi)
>>> <adds extent to current EFI>
>>> 
>>> xfs_defer_trans_roll()
>>>  <save>
>>>  xfs_trans_roll()
>>>    xfs_trans_dup()
>>>    xfs_trans_commit()
>>>  <restore>
>>> 
>>> At this point we have this committed to the CIL
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> 
>>> And xfs_defer_finish_noroll() continues with
>>> 
>>> <grabs first work item>
>>> xfs_defer_finish_one(dfp)
>>>  ->create_done(EFI 1)
>>>    xfs_extent_free_create_done
>>> <create EFD>
>>>  list_for_each_xefi
>>>    ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>      xfs_extent_free_finish_item()
>>> xfs_trans_free_extent
>>>  __xfs_free_extent
>>>    <adds extent to EFD>
>>> 
>>> And once the processing of the single work item is done we loop
>>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>>> have any new intents, so xfs_defer_create_intents() returns false,
>>> but we completed a dfp work item, so we run:
>>> 
>>> xfs_defer_trans_roll()
>>>  <save>
>>>  xfs_trans_roll()
>>>    xfs_trans_dup()
>>>    xfs_trans_commit()
>>>  <restore>
>>> 
>>> At this point we have this committed to the CIL
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1
>>> 
>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 2 with extent2
>>> 
>>> But we have not committed the final extent free or EFD 3 - that's in
>>> the last transaction context we pass back to the _xfs_trans_commit()
>>> context for it to finalise and close off the rolling transaction
>>> chain. At this point, we finally have this in the CIL:
>>> 
>>> EFI 1 with extent1
>>> EFI 2 with extent2
>>> EFI 3 with extent3
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 2 with extent2
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 3 with extent3
>> 
>> 
>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
>> thinking it only create intent for the first in tp->t_dfops.
>> 
>>> 
>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>>> 
>>> I *always* ignore CIL intent whiteouts when thinking about
>>> transaction chains and intents. That is purely a journal efficiency
>>> optimisation, not something that is necessary for correct operation.
>> 
>> OK.
>> 
>>> 
>>>> Your idea yields this:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3
>>>> free extent1
>>>> EFI 2 with extent2 extent3
>>>> EFD 1 with extent1 extent2 extent3
>>>> transaction commit
>>>> create transaction
>>>> free extent2
>>>> EFI 3 with extent3
>>>> EFD 2 with extent extent2 extent3
>>>> transaction commit
>>>> create transaction
>>>> free extent3
>>>> EFD 3 with extent3
>>>> transaction commit.
>>> 
>>> The EFI/EFD contents are correct, but the rest of it is not - I am
>>> not suggesting open coding transaction rolling like that. Everything
>>> I am suggesting works through the same defer ops mechanism as you
>>> are describing. So if we start with the initial journal contents
>>> looks like this:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> 
>>> Then we run xfs_defer_finish_one() on that work, 
>>> 
>>> xfs_defer_finish_one(dfp)
>>>  ->create_done(EFI 1)
>>>    xfs_extent_free_create_done
>>> <create EFD>
>>>  list_for_each_xefi
>>>    ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>      xfs_extent_free_finish_item()
>>> xfs_trans_free_extent
>>>  __xfs_free_extent
>>>    <adds extent to EFD>
>>> 
>>> But now we have 3 xefis on the work to process. So on success of
>>> the first call to xfs_trans_free_extent(), we want it to return
>>> -EAGAIN to trigger the existing relogging code to create the new
>>> EFI. How this works is describe in the section "Requesting a
>>> Fresh Transaction while Finishing Deferred Work" in
>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>>> 
>>> The result is that the deferred work infrastructure does the work
>>> of updaing the done intent and creating the new intents for the work
>>> remaining. Hence after the next transaction roll, we have in the CIL
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> <AGF, AGFL, free space btree block mods>
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent2 extent3.
>>> 
>> 
>> Taking transaction rolls into account (also adding up to EFD3), above would be:
>> 
>> EFI 1 with extent1 extent2 extent3.
>> transaction roll
>> <AGF, AGFL, free space btree block mods>  for extent 1
>> EFD 1 with extent1 extent2 extent3.
>> EFI 2 with extent2 extent3.
>> transaction roll
>> free extent 2
>> EFD 2 with extent2 extent3
>> EFI 3 with extent3
>> transaction roll
>> free extent 3
>> EFD 3 with extent3
>> 
>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
>> I got it.
>> 
>>> And so the loop goes until there is no more work remaining.
>>> 
>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>>> XFS_EFI_MAX_FAST_EXTENTS.
>>> 
>>> The difference is that what I'm suggesting is that the extent free
>>> code can decide when it needs a transaction to be rolled. It isn't
>>> forced to always run a single free per transaction, it can decide
>>> that it can free multiple extents per transaction if there is no
>>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>>> everything to be limited to a transaction per extent free even when
>>> there is no risk of deadlocks freeing multiple extents in a single
>>> transaction is unnecessary.
>>> 
>>> Indeed, if the second extent is in a different AG, we don't risk
>>> busy extents causing us issues, so we could do:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> <AGF 1, AGFL 1, free space btree block mods>
>>> <AGF 2, AGFL 2, free space btree block mods>
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent3.
>>> .....
>>> 
>> 
>> My thumb is up.
> 
> Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?
> 
> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
> are stored there. For those AGs
> they have pending busy extents in in-memory transactions (not committed to CIL).
> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
> if it’s not there, and continue to free the extent. Otherwise if the AG number is already
> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
> the busy extents are committed to the xfs_cil_ctx.
> Well, wondering if the pending busy extents would remain in pending state long before
> committed to CIL and that become a bottle neck for freeing extents.  

Thinking more, we could add per AG counter indicating current pending busy extents on the AG.
For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent
is always OK to go, there shouldn’t be a bottle neck.

thanks,
wengang

> 
> thanks,
> wengang
> 
>> 
>>> The difference is that limiting the number of xefi items per
>>> deferred work item means we can only process a single extent per
>>> work item regardless of the current context.
>>> 
>>> Using the existing defered work "on demand transaction rolling"
>>> mechanism I'm suggesting we use doesn't put any artificial
>>> constrains on how we log and process the intents. This allows us to
>>> aggregate multiple extent frees into a single transaction when there
>>> is no risk associated with doing so and we have sufficient
>>> transaction reservations remaining to guarantee we can free the
>>> extent. It's far more efficient, and the infrastructure we have in
>>> place already supports this sort of functionality....
>>> 
>>> When we go back to the original problem of the AGFL allocation
>>> needing to roll the transaction to get busy extents processed, then
>>> we could have it return -EAGAIN all the way back up to the defer-ops
>>> level and we simply then roll the transaction and retry the same
>>> extent free operation again. i.e. where extent freeing needs to
>>> block waiting on stuff like busy extents, we could now have these
>>> commit the current transaction, push the current work item to the
>>> back of the current context's queue and come back to it later.
>>> 
>>> At this point, I think the "single extent per transaction"
>>> constraint that is needed to avoid the busy extent deadlock goes
>>> away, and with it the need for limiting EFI processing to a single
>>> extent goes away too....
>> 
>> I am pretty clear now.
>>> 
>>>> And your implementation may use more log space than mine in case the EFI
>>>> (and EFD pair) can’t be cancelled.  :D
>>> 
>>> True, but it's really not a concern.  Don't conflate "intent
>>> recovery intent amplification can cause log space deadlocks" with
>>> "intent size is a problem". :)
>>> 
>> 
>> Got it.
>>>> The only difference if that you use transaction commit and I am using transaction roll
>>>> which is not safe as you said. 
>>>> 
>>>> Is my understanding correct?
>>> 
>>> I think I understand where we are misunderstanding each other :)
>>> Perhaps it is now obvious to you as well from the analysis above.
>>> If so, ignore the rest of this :)
>>> 
>>> What does "transaction roll" actually mean?
>>> 
>>> XFS has a concept of "rolling transactions". These are made up of a
>>> series of individual transaction contexts that are linked together
>>> by passing a single log reservation ticket between transaction
>>> contexts.
>>> 
>>> xfs_trans_roll() effectively does:
>>> 
>>> ntp = xfs_trans_dup(tp)
>>> ....
>>> xfs_trans_commit(tp)
>>> 
>>> return ntp;
>>> 
>>> i.e. it duplicates the current transaction handle, takes the unused
>>> block reservation from it, grabs the log reservation ticket from it,
>>> moves the defered ops from the old to the new transaction context,
>>> then commits the old transaction context and returns the new one.
>>> 
>>> tl;dr: a rolling transaction is a series of linked independent
>>> transactions that share a common set of block and log reservations.
>>> 
>>> To make a rolling transaction chain an atomic operation on a
>>> specific object (e.g. an inode) that object has to remain locked and
>>> be logged in every transaction in the chain of rolling transactions.
>>> This keeps it moving forward in the log and prevents it from pinning
>>> the tail of the log and causing log space deadlocks.
>>> 
>>> Fundamentally, though, each individual transaction in a rolling
>>> transaction is an independent atomic change set. So when you say
>>> "roll the transaction", what you are actually saying is "commit the
>>> current transaction and start a new transaction using the
>>> reservations the current transaction already holds."
>>> 
>>> When I say "transaction commit" I am only refering to the process
>>> that closes off the current transaction. If this is in the middle of
>>> a rolling transaction, then what I am talking about is
>>> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
>>> the current transaction context.
>>> 
>>> Transaction contexts running defered operations, intent chains, etc
>>> are *always* rolling transactions, and each time we roll the
>>> transaction we commit it.
>>> 
>>> IOWS, when I say "transaction commit" and you say "transaction roll"
>>> we are talking about exactly the same thing - transaction commit is
>>> the operation that roll performs to finish off the current change
>>> set...
>>> 
>>> Ideally, nobody should be calling xfs_trans_roll() directly anymore.
>>> All rolling transactions should be implemented using deferred ops
>>> nowdays - xfs_trans_roll() is the historic method of running rolling
>>> transactions. e.g. see the recent rework of the attribute code to
>>> use deferred ops rather than explicit calls to xfs_trans_roll() so
>>> we can use intents for running attr operations.
>>> 
>>> These days the transaction model is based around chains of deferred
>>> operations. We attach deferred work to the transaction, and then
>>> when we call xfs_trans_commit() it goes off into the deferred work
>>> infrastructure that creates intents and manages the transaction
>>> context for processing the intents itself.
>>> 
>>> This is the primary reason we are trying to move away from high
>>> level code using transaction rolling - we can't easily change the
>>> way we process operations when the high level code controls
>>> transaction contexts. Using deferred intent chains gives us fine
>>> grained control over transaction context in the deferred ops
>>> processing code - we can roll to a new transaction whenever we need
>>> to....
>>> 
>> 
>> Above is really helpful.
>> I when I mention transaction roll, I meant
>> 1) a new transaction is created, but its self does’t reserve any resource.
>> Instead, it inherits all the unused resources from the old transaction.
>> 2) commit the old transaction.
>> 3) don’t un-reserve unused resources from old transaction, the new transaction
>> will inherits them.
>> 4) use the new transaction for further log items.
>> 
>> As I understand, the key is that the new transaction its self doesn’t reserve new resources.
>> 
>> And when I read your “transaction commit” I understood it as this:
>> 1). commit old transaction
>> 2) un-reserve unused resources from old transaction
>> 3) create new transaction with all resources reserved.
>> 
>> Thus in my understand your “transaction commit” would have no risk of lack of log space to put
>> the extra EFI/EFDs.  But reading above, I’d think you meant “transaction roll” actually.
>> 
>>>> One question is that if only one EFI is safe per transaction, how
>>>> we ensure that there is only one EFI per transaction in case there
>>>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
>>>> free in current code?
>>> 
>>> That will handled exactly the same way it does with your change - it
>>> will simply split up the work items so there are multiple intents
>>> logged.
>> 
>> I’d like to make it more clear. You were saying that during log recover stage, there may be no enough
>> log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage),
>> seems you have no concern logging more EFI/EFDs.  So why they are different?
>> 
>> thanks,
>> wengang