Re: [PATCH 1/2] xfs: IO time one extent per EFI

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> 
>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>> We don't do that anymore for partially processed multi-extent
>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>> we log a complete intent done items alongside a new intent
>>>> containing the remaining work to be done in the same transaction.
>>>> This cancels the original intent and atomically replaces it with a
>>>> new intent containing the remaining work to be done.
>>>> 
>>>> So when I say "update the EFD" I'm using historic terminology for
>>>> processing and recovering multi-extent intents. In modern terms,
>>>> what I mean is "update the deferred work intent chain to reflect the
>>>> work remaining to be done".
>>> 
>>> OK. so let’s see the difference between your implementation from mine.
>>> Say, there are three extents to free.
>>> 
>>> My patch would result in:
>>> 
>>> EFI 1  with extent1
>>> free extent1
>>> EFD 1 with extent1
>>> transaction roll
>>> EFI 2 with extent2
>>> free extent2
>>> EFD 2 with extent2
>>> transaction roll
>>> EFI 3 with extent3
>>> free extent3
>>> EFD3 with extent3
>>> transaction commit
>> 
>> No, it wouldn't. This isn't how the deferred work processes work
>> items on the transaction. A work item with multiple extents on it
>> would result in this:
>> 
>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>> xfs_defer_finish_noroll
>>   xfs_defer_create_intents()
>>     list_for_each_defer_pending
>>       xfs_defer_create_intent(dfp)
>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>   xfs_extent_free_create_intent()
>>     <create EFI>
>>     list_for_each_xefi
>>       xfs_extent_free_log_item(xefi)
>> <adds extent to current EFI>
>> 
>> xfs_defer_trans_roll()
>>   <save>
>>   xfs_trans_roll()
>>     xfs_trans_dup()
>>     xfs_trans_commit()
>>   <restore>
>> 
>> At this point we have this committed to the CIL
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> 
>> And xfs_defer_finish_noroll() continues with
>> 
>> <grabs first work item>
>> xfs_defer_finish_one(dfp)
>>   ->create_done(EFI 1)
>>     xfs_extent_free_create_done
>> <create EFD>
>>   list_for_each_xefi
>>     ops->finish_item(tp, dfp->dfp_done, li, &state);
>>       xfs_extent_free_finish_item()
>> xfs_trans_free_extent
>>   __xfs_free_extent
>>     <adds extent to EFD>
>> 
>> And once the processing of the single work item is done we loop
>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>> have any new intents, so xfs_defer_create_intents() returns false,
>> but we completed a dfp work item, so we run:
>> 
>> xfs_defer_trans_roll()
>>   <save>
>>   xfs_trans_roll()
>>     xfs_trans_dup()
>>     xfs_trans_commit()
>>   <restore>
>> 
>> At this point we have this committed to the CIL
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1
>> 
>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1
>> <AGF, AGFL, free space btree block mods>
>> EFD 2 with extent2
>> 
>> But we have not committed the final extent free or EFD 3 - that's in
>> the last transaction context we pass back to the _xfs_trans_commit()
>> context for it to finalise and close off the rolling transaction
>> chain. At this point, we finally have this in the CIL:
>> 
>> EFI 1 with extent1
>> EFI 2 with extent2
>> EFI 3 with extent3
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1
>> <AGF, AGFL, free space btree block mods>
>> EFD 2 with extent2
>> <AGF, AGFL, free space btree block mods>
>> EFD 3 with extent3
> 
> 
> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
> thinking it only create intent for the first in tp->t_dfops.
> 
>> 
>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>> 
>> I *always* ignore CIL intent whiteouts when thinking about
>> transaction chains and intents. That is purely a journal efficiency
>> optimisation, not something that is necessary for correct operation.
> 
> OK.
> 
>> 
>>> Your idea yields this:
>>> 
>>> EFI 1 with extent1 extent2 extent3
>>> free extent1
>>> EFI 2 with extent2 extent3
>>> EFD 1 with extent1 extent2 extent3
>>> transaction commit
>>> create transaction
>>> free extent2
>>> EFI 3 with extent3
>>> EFD 2 with extent extent2 extent3
>>> transaction commit
>>> create transaction
>>> free extent3
>>> EFD 3 with extent3
>>> transaction commit.
>> 
>> The EFI/EFD contents are correct, but the rest of it is not - I am
>> not suggesting open coding transaction rolling like that. Everything
>> I am suggesting works through the same defer ops mechanism as you
>> are describing. So if we start with the initial journal contents
>> looks like this:
>> 
>> EFI 1 with extent1 extent2 extent3.
>> 
>> Then we run xfs_defer_finish_one() on that work, 
>> 
>> xfs_defer_finish_one(dfp)
>>   ->create_done(EFI 1)
>>     xfs_extent_free_create_done
>> <create EFD>
>>   list_for_each_xefi
>>     ops->finish_item(tp, dfp->dfp_done, li, &state);
>>       xfs_extent_free_finish_item()
>> xfs_trans_free_extent
>>   __xfs_free_extent
>>     <adds extent to EFD>
>> 
>> But now we have 3 xefis on the work to process. So on success of
>> the first call to xfs_trans_free_extent(), we want it to return
>> -EAGAIN to trigger the existing relogging code to create the new
>> EFI. How this works is describe in the section "Requesting a
>> Fresh Transaction while Finishing Deferred Work" in
>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>> 
>> The result is that the deferred work infrastructure does the work
>> of updaing the done intent and creating the new intents for the work
>> remaining. Hence after the next transaction roll, we have in the CIL
>> 
>> EFI 1 with extent1 extent2 extent3.
>> <AGF, AGFL, free space btree block mods>
>> EFD 1 with extent1 extent2 extent3.
>> EFI 2 with extent2 extent3.
>> 
> 
> Taking transaction rolls into account (also adding up to EFD3), above would be:
> 
> EFI 1 with extent1 extent2 extent3.
> transaction roll
> <AGF, AGFL, free space btree block mods>  for extent 1
> EFD 1 with extent1 extent2 extent3.
> EFI 2 with extent2 extent3.
> transaction roll
> free extent 2
> EFD 2 with extent2 extent3
> EFI 3 with extent3
> transaction roll
> free extent 3
> EFD 3 with extent3
> 
> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
> I got it.
> 
>> And so the loop goes until there is no more work remaining.
>> 
>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>> XFS_EFI_MAX_FAST_EXTENTS.
>> 
>> The difference is that what I'm suggesting is that the extent free
>> code can decide when it needs a transaction to be rolled. It isn't
>> forced to always run a single free per transaction, it can decide
>> that it can free multiple extents per transaction if there is no
>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>> everything to be limited to a transaction per extent free even when
>> there is no risk of deadlocks freeing multiple extents in a single
>> transaction is unnecessary.
>> 
>> Indeed, if the second extent is in a different AG, we don't risk
>> busy extents causing us issues, so we could do:
>> 
>> EFI 1 with extent1 extent2 extent3.
>> <AGF 1, AGFL 1, free space btree block mods>
>> <AGF 2, AGFL 2, free space btree block mods>
>> EFD 1 with extent1 extent2 extent3.
>> EFI 2 with extent3.
>> .....
>> 
> 
> My thumb is up.

Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?

Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
are stored there. For those AGs
they have pending busy extents in in-memory transactions (not committed to CIL).
We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
if it’s not there, and continue to free the extent. Otherwise if the AG number is already
there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
the busy extents are committed to the xfs_cil_ctx.
Well, wondering if the pending busy extents would remain in pending state long before
committed to CIL and that become a bottle neck for freeing extents.  

thanks,
wengang

> 
>> The difference is that limiting the number of xefi items per
>> deferred work item means we can only process a single extent per
>> work item regardless of the current context.
>> 
>> Using the existing defered work "on demand transaction rolling"
>> mechanism I'm suggesting we use doesn't put any artificial
>> constrains on how we log and process the intents. This allows us to
>> aggregate multiple extent frees into a single transaction when there
>> is no risk associated with doing so and we have sufficient
>> transaction reservations remaining to guarantee we can free the
>> extent. It's far more efficient, and the infrastructure we have in
>> place already supports this sort of functionality....
>> 
>> When we go back to the original problem of the AGFL allocation
>> needing to roll the transaction to get busy extents processed, then
>> we could have it return -EAGAIN all the way back up to the defer-ops
>> level and we simply then roll the transaction and retry the same
>> extent free operation again. i.e. where extent freeing needs to
>> block waiting on stuff like busy extents, we could now have these
>> commit the current transaction, push the current work item to the
>> back of the current context's queue and come back to it later.
>> 
>> At this point, I think the "single extent per transaction"
>> constraint that is needed to avoid the busy extent deadlock goes
>> away, and with it the need for limiting EFI processing to a single
>> extent goes away too....
> 
> I am pretty clear now.
>> 
>>> And your implementation may use more log space than mine in case the EFI
>>> (and EFD pair) can’t be cancelled.  :D
>> 
>> True, but it's really not a concern.  Don't conflate "intent
>> recovery intent amplification can cause log space deadlocks" with
>> "intent size is a problem". :)
>> 
> 
> Got it.
>>> The only difference if that you use transaction commit and I am using transaction roll
>>> which is not safe as you said. 
>>> 
>>> Is my understanding correct?
>> 
>> I think I understand where we are misunderstanding each other :)
>> Perhaps it is now obvious to you as well from the analysis above.
>> If so, ignore the rest of this :)
>> 
>> What does "transaction roll" actually mean?
>> 
>> XFS has a concept of "rolling transactions". These are made up of a
>> series of individual transaction contexts that are linked together
>> by passing a single log reservation ticket between transaction
>> contexts.
>> 
>> xfs_trans_roll() effectively does:
>> 
>> ntp = xfs_trans_dup(tp)
>> ....
>> xfs_trans_commit(tp)
>> 
>> return ntp;
>> 
>> i.e. it duplicates the current transaction handle, takes the unused
>> block reservation from it, grabs the log reservation ticket from it,
>> moves the defered ops from the old to the new transaction context,
>> then commits the old transaction context and returns the new one.
>> 
>> tl;dr: a rolling transaction is a series of linked independent
>> transactions that share a common set of block and log reservations.
>> 
>> To make a rolling transaction chain an atomic operation on a
>> specific object (e.g. an inode) that object has to remain locked and
>> be logged in every transaction in the chain of rolling transactions.
>> This keeps it moving forward in the log and prevents it from pinning
>> the tail of the log and causing log space deadlocks.
>> 
>> Fundamentally, though, each individual transaction in a rolling
>> transaction is an independent atomic change set. So when you say
>> "roll the transaction", what you are actually saying is "commit the
>> current transaction and start a new transaction using the
>> reservations the current transaction already holds."
>> 
>> When I say "transaction commit" I am only refering to the process
>> that closes off the current transaction. If this is in the middle of
>> a rolling transaction, then what I am talking about is
>> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
>> the current transaction context.
>> 
>> Transaction contexts running defered operations, intent chains, etc
>> are *always* rolling transactions, and each time we roll the
>> transaction we commit it.
>> 
>> IOWS, when I say "transaction commit" and you say "transaction roll"
>> we are talking about exactly the same thing - transaction commit is
>> the operation that roll performs to finish off the current change
>> set...
>> 
>> Ideally, nobody should be calling xfs_trans_roll() directly anymore.
>> All rolling transactions should be implemented using deferred ops
>> nowdays - xfs_trans_roll() is the historic method of running rolling
>> transactions. e.g. see the recent rework of the attribute code to
>> use deferred ops rather than explicit calls to xfs_trans_roll() so
>> we can use intents for running attr operations.
>> 
>> These days the transaction model is based around chains of deferred
>> operations. We attach deferred work to the transaction, and then
>> when we call xfs_trans_commit() it goes off into the deferred work
>> infrastructure that creates intents and manages the transaction
>> context for processing the intents itself.
>> 
>> This is the primary reason we are trying to move away from high
>> level code using transaction rolling - we can't easily change the
>> way we process operations when the high level code controls
>> transaction contexts. Using deferred intent chains gives us fine
>> grained control over transaction context in the deferred ops
>> processing code - we can roll to a new transaction whenever we need
>> to....
>> 
> 
> Above is really helpful.
> I when I mention transaction roll, I meant
> 1) a new transaction is created, but its self does’t reserve any resource.
> Instead, it inherits all the unused resources from the old transaction.
> 2) commit the old transaction.
> 3) don’t un-reserve unused resources from old transaction, the new transaction
> will inherits them.
> 4) use the new transaction for further log items.
> 
> As I understand, the key is that the new transaction its self doesn’t reserve new resources.
> 
> And when I read your “transaction commit” I understood it as this:
> 1). commit old transaction
> 2) un-reserve unused resources from old transaction
> 3) create new transaction with all resources reserved.
> 
> Thus in my understand your “transaction commit” would have no risk of lack of log space to put
> the extra EFI/EFDs.  But reading above, I’d think you meant “transaction roll” actually.
> 
>>> One question is that if only one EFI is safe per transaction, how
>>> we ensure that there is only one EFI per transaction in case there
>>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
>>> free in current code?
>> 
>> That will handled exactly the same way it does with your change - it
>> will simply split up the work items so there are multiple intents
>> logged.
> 
> I’d like to make it more clear. You were saying that during log recover stage, there may be no enough
> log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage),
> seems you have no concern logging more EFI/EFDs.  So why they are different?
> 
> thanks,
> wengang






[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux