> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: > > > >> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> >> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>> We don't do that anymore for partially processed multi-extent >>>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>>> we log a complete intent done items alongside a new intent >>>> containing the remaining work to be done in the same transaction. >>>> This cancels the original intent and atomically replaces it with a >>>> new intent containing the remaining work to be done. >>>> >>>> So when I say "update the EFD" I'm using historic terminology for >>>> processing and recovering multi-extent intents. In modern terms, >>>> what I mean is "update the deferred work intent chain to reflect the >>>> work remaining to be done". >>> >>> OK. so let’s see the difference between your implementation from mine. >>> Say, there are three extents to free. >>> >>> My patch would result in: >>> >>> EFI 1 with extent1 >>> free extent1 >>> EFD 1 with extent1 >>> transaction roll >>> EFI 2 with extent2 >>> free extent2 >>> EFD 2 with extent2 >>> transaction roll >>> EFI 3 with extent3 >>> free extent3 >>> EFD3 with extent3 >>> transaction commit >> >> No, it wouldn't. This isn't how the deferred work processes work >> items on the transaction. A work item with multiple extents on it >> would result in this: >> >> xfs_defer_finish(tp) # tp contains three xefi work items >> xfs_defer_finish_noroll >> xfs_defer_create_intents() >> list_for_each_defer_pending >> xfs_defer_create_intent(dfp) >> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); >> xfs_extent_free_create_intent() >> <create EFI> >> list_for_each_xefi >> xfs_extent_free_log_item(xefi) >> <adds extent to current EFI> >> >> xfs_defer_trans_roll() >> <save> >> xfs_trans_roll() >> xfs_trans_dup() >> xfs_trans_commit() >> <restore> >> >> At this point we have this committed to the CIL >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> >> And xfs_defer_finish_noroll() continues with >> >> <grabs first work item> >> xfs_defer_finish_one(dfp) >> ->create_done(EFI 1) >> xfs_extent_free_create_done >> <create EFD> >> list_for_each_xefi >> ops->finish_item(tp, dfp->dfp_done, li, &state); >> xfs_extent_free_finish_item() >> xfs_trans_free_extent >> __xfs_free_extent >> <adds extent to EFD> >> >> And once the processing of the single work item is done we loop >> back to the start of the xfs_defer_finish_noroll() loop. We don't >> have any new intents, so xfs_defer_create_intents() returns false, >> but we completed a dfp work item, so we run: >> >> xfs_defer_trans_roll() >> <save> >> xfs_trans_roll() >> xfs_trans_dup() >> xfs_trans_commit() >> <restore> >> >> At this point we have this committed to the CIL >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 >> >> Then we run xfs_defer_finish_one() on EFI 2, commit, then run >> xfs_defer_finish_one() on EFI 3. At this point, we have in the log: >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 >> <AGF, AGFL, free space btree block mods> >> EFD 2 with extent2 >> >> But we have not committed the final extent free or EFD 3 - that's in >> the last transaction context we pass back to the _xfs_trans_commit() >> context for it to finalise and close off the rolling transaction >> chain. At this point, we finally have this in the CIL: >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 >> <AGF, AGFL, free space btree block mods> >> EFD 2 with extent2 >> <AGF, AGFL, free space btree block mods> >> EFD 3 with extent3 > > > Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() > thinking it only create intent for the first in tp->t_dfops. > >> >>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. >> >> I *always* ignore CIL intent whiteouts when thinking about >> transaction chains and intents. That is purely a journal efficiency >> optimisation, not something that is necessary for correct operation. > > OK. > >> >>> Your idea yields this: >>> >>> EFI 1 with extent1 extent2 extent3 >>> free extent1 >>> EFI 2 with extent2 extent3 >>> EFD 1 with extent1 extent2 extent3 >>> transaction commit >>> create transaction >>> free extent2 >>> EFI 3 with extent3 >>> EFD 2 with extent extent2 extent3 >>> transaction commit >>> create transaction >>> free extent3 >>> EFD 3 with extent3 >>> transaction commit. >> >> The EFI/EFD contents are correct, but the rest of it is not - I am >> not suggesting open coding transaction rolling like that. Everything >> I am suggesting works through the same defer ops mechanism as you >> are describing. So if we start with the initial journal contents >> looks like this: >> >> EFI 1 with extent1 extent2 extent3. >> >> Then we run xfs_defer_finish_one() on that work, >> >> xfs_defer_finish_one(dfp) >> ->create_done(EFI 1) >> xfs_extent_free_create_done >> <create EFD> >> list_for_each_xefi >> ops->finish_item(tp, dfp->dfp_done, li, &state); >> xfs_extent_free_finish_item() >> xfs_trans_free_extent >> __xfs_free_extent >> <adds extent to EFD> >> >> But now we have 3 xefis on the work to process. So on success of >> the first call to xfs_trans_free_extent(), we want it to return >> -EAGAIN to trigger the existing relogging code to create the new >> EFI. How this works is describe in the section "Requesting a >> Fresh Transaction while Finishing Deferred Work" in >> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. >> >> The result is that the deferred work infrastructure does the work >> of updaing the done intent and creating the new intents for the work >> remaining. Hence after the next transaction roll, we have in the CIL >> >> EFI 1 with extent1 extent2 extent3. >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 extent2 extent3. >> EFI 2 with extent2 extent3. >> > > Taking transaction rolls into account (also adding up to EFD3), above would be: > > EFI 1 with extent1 extent2 extent3. > transaction roll > <AGF, AGFL, free space btree block mods> for extent 1 > EFD 1 with extent1 extent2 extent3. > EFI 2 with extent2 extent3. > transaction roll > free extent 2 > EFD 2 with extent2 extent3 > EFI 3 with extent3 > transaction roll > free extent 3 > EFD 3 with extent3 > > Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. > I got it. > >> And so the loop goes until there is no more work remaining. >> >>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >>> So actually it’s still one extent per EFI per transaction. Though you are not changing >>> XFS_EFI_MAX_FAST_EXTENTS. >> >> The difference is that what I'm suggesting is that the extent free >> code can decide when it needs a transaction to be rolled. It isn't >> forced to always run a single free per transaction, it can decide >> that it can free multiple extents per transaction if there is no >> risk of deadlocks (e.g. extents are in different AGs). Forcing >> everything to be limited to a transaction per extent free even when >> there is no risk of deadlocks freeing multiple extents in a single >> transaction is unnecessary. >> >> Indeed, if the second extent is in a different AG, we don't risk >> busy extents causing us issues, so we could do: >> >> EFI 1 with extent1 extent2 extent3. >> <AGF 1, AGFL 1, free space btree block mods> >> <AGF 2, AGFL 2, free space btree block mods> >> EFD 1 with extent1 extent2 extent3. >> EFI 2 with extent3. >> ..... >> > > My thumb is up. Well, I am wondering if there is ABBA deadlock instead of self deadlock then. Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1 and now blocking at AGFL allocation on AG 0. Anything prevents that from happening? Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number are stored there. For those AGs they have pending busy extents in in-memory transactions (not committed to CIL). We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent() if it’s not there, and continue to free the extent. Otherwise if the AG number is already there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when the busy extents are committed to the xfs_cil_ctx. Well, wondering if the pending busy extents would remain in pending state long before committed to CIL and that become a bottle neck for freeing extents. thanks, wengang > >> The difference is that limiting the number of xefi items per >> deferred work item means we can only process a single extent per >> work item regardless of the current context. >> >> Using the existing defered work "on demand transaction rolling" >> mechanism I'm suggesting we use doesn't put any artificial >> constrains on how we log and process the intents. This allows us to >> aggregate multiple extent frees into a single transaction when there >> is no risk associated with doing so and we have sufficient >> transaction reservations remaining to guarantee we can free the >> extent. It's far more efficient, and the infrastructure we have in >> place already supports this sort of functionality.... >> >> When we go back to the original problem of the AGFL allocation >> needing to roll the transaction to get busy extents processed, then >> we could have it return -EAGAIN all the way back up to the defer-ops >> level and we simply then roll the transaction and retry the same >> extent free operation again. i.e. where extent freeing needs to >> block waiting on stuff like busy extents, we could now have these >> commit the current transaction, push the current work item to the >> back of the current context's queue and come back to it later. >> >> At this point, I think the "single extent per transaction" >> constraint that is needed to avoid the busy extent deadlock goes >> away, and with it the need for limiting EFI processing to a single >> extent goes away too.... > > I am pretty clear now. >> >>> And your implementation may use more log space than mine in case the EFI >>> (and EFD pair) can’t be cancelled. :D >> >> True, but it's really not a concern. Don't conflate "intent >> recovery intent amplification can cause log space deadlocks" with >> "intent size is a problem". :) >> > > Got it. >>> The only difference if that you use transaction commit and I am using transaction roll >>> which is not safe as you said. >>> >>> Is my understanding correct? >> >> I think I understand where we are misunderstanding each other :) >> Perhaps it is now obvious to you as well from the analysis above. >> If so, ignore the rest of this :) >> >> What does "transaction roll" actually mean? >> >> XFS has a concept of "rolling transactions". These are made up of a >> series of individual transaction contexts that are linked together >> by passing a single log reservation ticket between transaction >> contexts. >> >> xfs_trans_roll() effectively does: >> >> ntp = xfs_trans_dup(tp) >> .... >> xfs_trans_commit(tp) >> >> return ntp; >> >> i.e. it duplicates the current transaction handle, takes the unused >> block reservation from it, grabs the log reservation ticket from it, >> moves the defered ops from the old to the new transaction context, >> then commits the old transaction context and returns the new one. >> >> tl;dr: a rolling transaction is a series of linked independent >> transactions that share a common set of block and log reservations. >> >> To make a rolling transaction chain an atomic operation on a >> specific object (e.g. an inode) that object has to remain locked and >> be logged in every transaction in the chain of rolling transactions. >> This keeps it moving forward in the log and prevents it from pinning >> the tail of the log and causing log space deadlocks. >> >> Fundamentally, though, each individual transaction in a rolling >> transaction is an independent atomic change set. So when you say >> "roll the transaction", what you are actually saying is "commit the >> current transaction and start a new transaction using the >> reservations the current transaction already holds." >> >> When I say "transaction commit" I am only refering to the process >> that closes off the current transaction. If this is in the middle of >> a rolling transaction, then what I am talking about is >> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated >> the current transaction context. >> >> Transaction contexts running defered operations, intent chains, etc >> are *always* rolling transactions, and each time we roll the >> transaction we commit it. >> >> IOWS, when I say "transaction commit" and you say "transaction roll" >> we are talking about exactly the same thing - transaction commit is >> the operation that roll performs to finish off the current change >> set... >> >> Ideally, nobody should be calling xfs_trans_roll() directly anymore. >> All rolling transactions should be implemented using deferred ops >> nowdays - xfs_trans_roll() is the historic method of running rolling >> transactions. e.g. see the recent rework of the attribute code to >> use deferred ops rather than explicit calls to xfs_trans_roll() so >> we can use intents for running attr operations. >> >> These days the transaction model is based around chains of deferred >> operations. We attach deferred work to the transaction, and then >> when we call xfs_trans_commit() it goes off into the deferred work >> infrastructure that creates intents and manages the transaction >> context for processing the intents itself. >> >> This is the primary reason we are trying to move away from high >> level code using transaction rolling - we can't easily change the >> way we process operations when the high level code controls >> transaction contexts. Using deferred intent chains gives us fine >> grained control over transaction context in the deferred ops >> processing code - we can roll to a new transaction whenever we need >> to.... >> > > Above is really helpful. > I when I mention transaction roll, I meant > 1) a new transaction is created, but its self does’t reserve any resource. > Instead, it inherits all the unused resources from the old transaction. > 2) commit the old transaction. > 3) don’t un-reserve unused resources from old transaction, the new transaction > will inherits them. > 4) use the new transaction for further log items. > > As I understand, the key is that the new transaction its self doesn’t reserve new resources. > > And when I read your “transaction commit” I understood it as this: > 1). commit old transaction > 2) un-reserve unused resources from old transaction > 3) create new transaction with all resources reserved. > > Thus in my understand your “transaction commit” would have no risk of lack of log space to put > the extra EFI/EFDs. But reading above, I’d think you meant “transaction roll” actually. > >>> One question is that if only one EFI is safe per transaction, how >>> we ensure that there is only one EFI per transaction in case there >>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to >>> free in current code? >> >> That will handled exactly the same way it does with your change - it >> will simply split up the work items so there are multiple intents >> logged. > > I’d like to make it more clear. You were saying that during log recover stage, there may be no enough > log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage), > seems you have no concern logging more EFI/EFDs. So why they are different? > > thanks, > wengang