> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: > > > >> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >> >> >> >>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> >>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> We don't do that anymore for partially processed multi-extent >>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>>>> we log a complete intent done items alongside a new intent >>>>> containing the remaining work to be done in the same transaction. >>>>> This cancels the original intent and atomically replaces it with a >>>>> new intent containing the remaining work to be done. >>>>> >>>>> So when I say "update the EFD" I'm using historic terminology for >>>>> processing and recovering multi-extent intents. In modern terms, >>>>> what I mean is "update the deferred work intent chain to reflect the >>>>> work remaining to be done". >>>> >>>> OK. so let’s see the difference between your implementation from mine. >>>> Say, there are three extents to free. >>>> >>>> My patch would result in: >>>> >>>> EFI 1 with extent1 >>>> free extent1 >>>> EFD 1 with extent1 >>>> transaction roll >>>> EFI 2 with extent2 >>>> free extent2 >>>> EFD 2 with extent2 >>>> transaction roll >>>> EFI 3 with extent3 >>>> free extent3 >>>> EFD3 with extent3 >>>> transaction commit >>> >>> No, it wouldn't. This isn't how the deferred work processes work >>> items on the transaction. A work item with multiple extents on it >>> would result in this: >>> >>> xfs_defer_finish(tp) # tp contains three xefi work items >>> xfs_defer_finish_noroll >>> xfs_defer_create_intents() >>> list_for_each_defer_pending >>> xfs_defer_create_intent(dfp) >>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); >>> xfs_extent_free_create_intent() >>> <create EFI> >>> list_for_each_xefi >>> xfs_extent_free_log_item(xefi) >>> <adds extent to current EFI> >>> >>> xfs_defer_trans_roll() >>> <save> >>> xfs_trans_roll() >>> xfs_trans_dup() >>> xfs_trans_commit() >>> <restore> >>> >>> At this point we have this committed to the CIL >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> >>> And xfs_defer_finish_noroll() continues with >>> >>> <grabs first work item> >>> xfs_defer_finish_one(dfp) >>> ->create_done(EFI 1) >>> xfs_extent_free_create_done >>> <create EFD> >>> list_for_each_xefi >>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>> xfs_extent_free_finish_item() >>> xfs_trans_free_extent >>> __xfs_free_extent >>> <adds extent to EFD> >>> >>> And once the processing of the single work item is done we loop >>> back to the start of the xfs_defer_finish_noroll() loop. We don't >>> have any new intents, so xfs_defer_create_intents() returns false, >>> but we completed a dfp work item, so we run: >>> >>> xfs_defer_trans_roll() >>> <save> >>> xfs_trans_roll() >>> xfs_trans_dup() >>> xfs_trans_commit() >>> <restore> >>> >>> At this point we have this committed to the CIL >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 >>> >>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run >>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log: >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 >>> <AGF, AGFL, free space btree block mods> >>> EFD 2 with extent2 >>> >>> But we have not committed the final extent free or EFD 3 - that's in >>> the last transaction context we pass back to the _xfs_trans_commit() >>> context for it to finalise and close off the rolling transaction >>> chain. At this point, we finally have this in the CIL: >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 >>> <AGF, AGFL, free space btree block mods> >>> EFD 2 with extent2 >>> <AGF, AGFL, free space btree block mods> >>> EFD 3 with extent3 >> >> >> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() >> thinking it only create intent for the first in tp->t_dfops. >> >>> >>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. >>> >>> I *always* ignore CIL intent whiteouts when thinking about >>> transaction chains and intents. That is purely a journal efficiency >>> optimisation, not something that is necessary for correct operation. >> >> OK. >> >>> >>>> Your idea yields this: >>>> >>>> EFI 1 with extent1 extent2 extent3 >>>> free extent1 >>>> EFI 2 with extent2 extent3 >>>> EFD 1 with extent1 extent2 extent3 >>>> transaction commit >>>> create transaction >>>> free extent2 >>>> EFI 3 with extent3 >>>> EFD 2 with extent extent2 extent3 >>>> transaction commit >>>> create transaction >>>> free extent3 >>>> EFD 3 with extent3 >>>> transaction commit. >>> >>> The EFI/EFD contents are correct, but the rest of it is not - I am >>> not suggesting open coding transaction rolling like that. Everything >>> I am suggesting works through the same defer ops mechanism as you >>> are describing. So if we start with the initial journal contents >>> looks like this: >>> >>> EFI 1 with extent1 extent2 extent3. >>> >>> Then we run xfs_defer_finish_one() on that work, >>> >>> xfs_defer_finish_one(dfp) >>> ->create_done(EFI 1) >>> xfs_extent_free_create_done >>> <create EFD> >>> list_for_each_xefi >>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>> xfs_extent_free_finish_item() >>> xfs_trans_free_extent >>> __xfs_free_extent >>> <adds extent to EFD> >>> >>> But now we have 3 xefis on the work to process. So on success of >>> the first call to xfs_trans_free_extent(), we want it to return >>> -EAGAIN to trigger the existing relogging code to create the new >>> EFI. How this works is describe in the section "Requesting a >>> Fresh Transaction while Finishing Deferred Work" in >>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. >>> >>> The result is that the deferred work infrastructure does the work >>> of updaing the done intent and creating the new intents for the work >>> remaining. Hence after the next transaction roll, we have in the CIL >>> >>> EFI 1 with extent1 extent2 extent3. >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 extent2 extent3. >>> EFI 2 with extent2 extent3. >>> >> >> Taking transaction rolls into account (also adding up to EFD3), above would be: >> >> EFI 1 with extent1 extent2 extent3. >> transaction roll >> <AGF, AGFL, free space btree block mods> for extent 1 >> EFD 1 with extent1 extent2 extent3. >> EFI 2 with extent2 extent3. >> transaction roll >> free extent 2 >> EFD 2 with extent2 extent3 >> EFI 3 with extent3 >> transaction roll >> free extent 3 >> EFD 3 with extent3 >> >> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. >> I got it. >> >>> And so the loop goes until there is no more work remaining. >>> >>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >>>> So actually it’s still one extent per EFI per transaction. Though you are not changing >>>> XFS_EFI_MAX_FAST_EXTENTS. >>> >>> The difference is that what I'm suggesting is that the extent free >>> code can decide when it needs a transaction to be rolled. It isn't >>> forced to always run a single free per transaction, it can decide >>> that it can free multiple extents per transaction if there is no >>> risk of deadlocks (e.g. extents are in different AGs). Forcing >>> everything to be limited to a transaction per extent free even when >>> there is no risk of deadlocks freeing multiple extents in a single >>> transaction is unnecessary. >>> >>> Indeed, if the second extent is in a different AG, we don't risk >>> busy extents causing us issues, so we could do: >>> >>> EFI 1 with extent1 extent2 extent3. >>> <AGF 1, AGFL 1, free space btree block mods> >>> <AGF 2, AGFL 2, free space btree block mods> >>> EFD 1 with extent1 extent2 extent3. >>> EFI 2 with extent3. >>> ..... >>> >> >> My thumb is up. > > Well, I am wondering if there is ABBA deadlock instead of self deadlock then. > Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL > allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1 > and now blocking at AGFL allocation on AG 0. Anything prevents that from happening? > > Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number > are stored there. For those AGs > they have pending busy extents in in-memory transactions (not committed to CIL). > We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent() > if it’s not there, and continue to free the extent. Otherwise if the AG number is already > there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when > the busy extents are committed to the xfs_cil_ctx. > Well, wondering if the pending busy extents would remain in pending state long before > committed to CIL and that become a bottle neck for freeing extents. Thinking more, we could add per AG counter indicating current pending busy extents on the AG. For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent is always OK to go, there shouldn’t be a bottle neck. thanks, wengang > > thanks, > wengang > >> >>> The difference is that limiting the number of xefi items per >>> deferred work item means we can only process a single extent per >>> work item regardless of the current context. >>> >>> Using the existing defered work "on demand transaction rolling" >>> mechanism I'm suggesting we use doesn't put any artificial >>> constrains on how we log and process the intents. This allows us to >>> aggregate multiple extent frees into a single transaction when there >>> is no risk associated with doing so and we have sufficient >>> transaction reservations remaining to guarantee we can free the >>> extent. It's far more efficient, and the infrastructure we have in >>> place already supports this sort of functionality.... >>> >>> When we go back to the original problem of the AGFL allocation >>> needing to roll the transaction to get busy extents processed, then >>> we could have it return -EAGAIN all the way back up to the defer-ops >>> level and we simply then roll the transaction and retry the same >>> extent free operation again. i.e. where extent freeing needs to >>> block waiting on stuff like busy extents, we could now have these >>> commit the current transaction, push the current work item to the >>> back of the current context's queue and come back to it later. >>> >>> At this point, I think the "single extent per transaction" >>> constraint that is needed to avoid the busy extent deadlock goes >>> away, and with it the need for limiting EFI processing to a single >>> extent goes away too.... >> >> I am pretty clear now. >>> >>>> And your implementation may use more log space than mine in case the EFI >>>> (and EFD pair) can’t be cancelled. :D >>> >>> True, but it's really not a concern. Don't conflate "intent >>> recovery intent amplification can cause log space deadlocks" with >>> "intent size is a problem". :) >>> >> >> Got it. >>>> The only difference if that you use transaction commit and I am using transaction roll >>>> which is not safe as you said. >>>> >>>> Is my understanding correct? >>> >>> I think I understand where we are misunderstanding each other :) >>> Perhaps it is now obvious to you as well from the analysis above. >>> If so, ignore the rest of this :) >>> >>> What does "transaction roll" actually mean? >>> >>> XFS has a concept of "rolling transactions". These are made up of a >>> series of individual transaction contexts that are linked together >>> by passing a single log reservation ticket between transaction >>> contexts. >>> >>> xfs_trans_roll() effectively does: >>> >>> ntp = xfs_trans_dup(tp) >>> .... >>> xfs_trans_commit(tp) >>> >>> return ntp; >>> >>> i.e. it duplicates the current transaction handle, takes the unused >>> block reservation from it, grabs the log reservation ticket from it, >>> moves the defered ops from the old to the new transaction context, >>> then commits the old transaction context and returns the new one. >>> >>> tl;dr: a rolling transaction is a series of linked independent >>> transactions that share a common set of block and log reservations. >>> >>> To make a rolling transaction chain an atomic operation on a >>> specific object (e.g. an inode) that object has to remain locked and >>> be logged in every transaction in the chain of rolling transactions. >>> This keeps it moving forward in the log and prevents it from pinning >>> the tail of the log and causing log space deadlocks. >>> >>> Fundamentally, though, each individual transaction in a rolling >>> transaction is an independent atomic change set. So when you say >>> "roll the transaction", what you are actually saying is "commit the >>> current transaction and start a new transaction using the >>> reservations the current transaction already holds." >>> >>> When I say "transaction commit" I am only refering to the process >>> that closes off the current transaction. If this is in the middle of >>> a rolling transaction, then what I am talking about is >>> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated >>> the current transaction context. >>> >>> Transaction contexts running defered operations, intent chains, etc >>> are *always* rolling transactions, and each time we roll the >>> transaction we commit it. >>> >>> IOWS, when I say "transaction commit" and you say "transaction roll" >>> we are talking about exactly the same thing - transaction commit is >>> the operation that roll performs to finish off the current change >>> set... >>> >>> Ideally, nobody should be calling xfs_trans_roll() directly anymore. >>> All rolling transactions should be implemented using deferred ops >>> nowdays - xfs_trans_roll() is the historic method of running rolling >>> transactions. e.g. see the recent rework of the attribute code to >>> use deferred ops rather than explicit calls to xfs_trans_roll() so >>> we can use intents for running attr operations. >>> >>> These days the transaction model is based around chains of deferred >>> operations. We attach deferred work to the transaction, and then >>> when we call xfs_trans_commit() it goes off into the deferred work >>> infrastructure that creates intents and manages the transaction >>> context for processing the intents itself. >>> >>> This is the primary reason we are trying to move away from high >>> level code using transaction rolling - we can't easily change the >>> way we process operations when the high level code controls >>> transaction contexts. Using deferred intent chains gives us fine >>> grained control over transaction context in the deferred ops >>> processing code - we can roll to a new transaction whenever we need >>> to.... >>> >> >> Above is really helpful. >> I when I mention transaction roll, I meant >> 1) a new transaction is created, but its self does’t reserve any resource. >> Instead, it inherits all the unused resources from the old transaction. >> 2) commit the old transaction. >> 3) don’t un-reserve unused resources from old transaction, the new transaction >> will inherits them. >> 4) use the new transaction for further log items. >> >> As I understand, the key is that the new transaction its self doesn’t reserve new resources. >> >> And when I read your “transaction commit” I understood it as this: >> 1). commit old transaction >> 2) un-reserve unused resources from old transaction >> 3) create new transaction with all resources reserved. >> >> Thus in my understand your “transaction commit” would have no risk of lack of log space to put >> the extra EFI/EFDs. But reading above, I’d think you meant “transaction roll” actually. >> >>>> One question is that if only one EFI is safe per transaction, how >>>> we ensure that there is only one EFI per transaction in case there >>>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to >>>> free in current code? >>> >>> That will handled exactly the same way it does with your change - it >>> will simply split up the work items so there are multiple intents >>> logged. >> >> I’d like to make it more clear. You were saying that during log recover stage, there may be no enough >> log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage), >> seems you have no concern logging more EFI/EFDs. So why they are different? >> >> thanks, >> wengang