On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: > > On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > We don't do that anymore for partially processed multi-extent > > intents anymore. Instead, we use deferred ops to chain updates. i.e. > > we log a complete intent done items alongside a new intent > > containing the remaining work to be done in the same transaction. > > This cancels the original intent and atomically replaces it with a > > new intent containing the remaining work to be done. > > > > So when I say "update the EFD" I'm using historic terminology for > > processing and recovering multi-extent intents. In modern terms, > > what I mean is "update the deferred work intent chain to reflect the > > work remaining to be done". > > OK. so let’s see the difference between your implementation from mine. > Say, there are three extents to free. > > My patch would result in: > > EFI 1 with extent1 > free extent1 > EFD 1 with extent1 > transaction roll > EFI 2 with extent2 > free extent2 > EFD 2 with extent2 > transaction roll > EFI 3 with extent3 > free extent3 > EFD3 with extent3 > transaction commit No, it wouldn't. This isn't how the deferred work processes work items on the transaction. A work item with multiple extents on it would result in this: xfs_defer_finish(tp) # tp contains three xefi work items xfs_defer_finish_noroll xfs_defer_create_intents() list_for_each_defer_pending xfs_defer_create_intent(dfp) ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); xfs_extent_free_create_intent() <create EFI> list_for_each_xefi xfs_extent_free_log_item(xefi) <adds extent to current EFI> xfs_defer_trans_roll() <save> xfs_trans_roll() xfs_trans_dup() xfs_trans_commit() <restore> At this point we have this committed to the CIL EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 And xfs_defer_finish_noroll() continues with <grabs first work item> xfs_defer_finish_one(dfp) ->create_done(EFI 1) xfs_extent_free_create_done <create EFD> list_for_each_xefi ops->finish_item(tp, dfp->dfp_done, li, &state); xfs_extent_free_finish_item() xfs_trans_free_extent __xfs_free_extent <adds extent to EFD> And once the processing of the single work item is done we loop back to the start of the xfs_defer_finish_noroll() loop. We don't have any new intents, so xfs_defer_create_intents() returns false, but we completed a dfp work item, so we run: xfs_defer_trans_roll() <save> xfs_trans_roll() xfs_trans_dup() xfs_trans_commit() <restore> At this point we have this committed to the CIL EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 <AGF, AGFL, free space btree block mods> EFD 1 with extent1 Then we run xfs_defer_finish_one() on EFI 2, commit, then run xfs_defer_finish_one() on EFI 3. At this point, we have in the log: EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 <AGF, AGFL, free space btree block mods> EFD 1 with extent1 <AGF, AGFL, free space btree block mods> EFD 2 with extent2 But we have not committed the final extent free or EFD 3 - that's in the last transaction context we pass back to the _xfs_trans_commit() context for it to finalise and close off the rolling transaction chain. At this point, we finally have this in the CIL: EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 <AGF, AGFL, free space btree block mods> EFD 1 with extent1 <AGF, AGFL, free space btree block mods> EFD 2 with extent2 <AGF, AGFL, free space btree block mods> EFD 3 with extent3 > The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. I *always* ignore CIL intent whiteouts when thinking about transaction chains and intents. That is purely a journal efficiency optimisation, not something that is necessary for correct operation. > Your idea yields this: > > EFI 1 with extent1 extent2 extent3 > free extent1 > EFI 2 with extent2 extent3 > EFD 1 with extent1 extent2 extent3 > transaction commit > create transaction > free extent2 > EFI 3 with extent3 > EFD 2 with extent extent2 extent3 > transaction commit > create transaction > free extent3 > EFD 3 with extent3 > transaction commit. The EFI/EFD contents are correct, but the rest of it is not - I am not suggesting open coding transaction rolling like that. Everything I am suggesting works through the same defer ops mechanism as you are describing. So if we start with the initial journal contents looks like this: EFI 1 with extent1 extent2 extent3. Then we run xfs_defer_finish_one() on that work, xfs_defer_finish_one(dfp) ->create_done(EFI 1) xfs_extent_free_create_done <create EFD> list_for_each_xefi ops->finish_item(tp, dfp->dfp_done, li, &state); xfs_extent_free_finish_item() xfs_trans_free_extent __xfs_free_extent <adds extent to EFD> But now we have 3 xefis on the work to process. So on success of the first call to xfs_trans_free_extent(), we want it to return -EAGAIN to trigger the existing relogging code to create the new EFI. How this works is describe in the section "Requesting a Fresh Transaction while Finishing Deferred Work" in fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. The result is that the deferred work infrastructure does the work of updaing the done intent and creating the new intents for the work remaining. Hence after the next transaction roll, we have in the CIL EFI 1 with extent1 extent2 extent3. <AGF, AGFL, free space btree block mods> EFD 1 with extent1 extent2 extent3. EFI 2 with extent2 extent3. And so the loop goes until there is no more work remaining. > Your implementation also includes three EFI/EFD pairs, that’s the same as mine. > So actually it’s still one extent per EFI per transaction. Though you are not changing > XFS_EFI_MAX_FAST_EXTENTS. The difference is that what I'm suggesting is that the extent free code can decide when it needs a transaction to be rolled. It isn't forced to always run a single free per transaction, it can decide that it can free multiple extents per transaction if there is no risk of deadlocks (e.g. extents are in different AGs). Forcing everything to be limited to a transaction per extent free even when there is no risk of deadlocks freeing multiple extents in a single transaction is unnecessary. Indeed, if the second extent is in a different AG, we don't risk busy extents causing us issues, so we could do: EFI 1 with extent1 extent2 extent3. <AGF 1, AGFL 1, free space btree block mods> <AGF 2, AGFL 2, free space btree block mods> EFD 1 with extent1 extent2 extent3. EFI 2 with extent3. ..... The difference is that limiting the number of xefi items per deferred work item means we can only process a single extent per work item regardless of the current context. Using the existing defered work "on demand transaction rolling" mechanism I'm suggesting we use doesn't put any artificial constrains on how we log and process the intents. This allows us to aggregate multiple extent frees into a single transaction when there is no risk associated with doing so and we have sufficient transaction reservations remaining to guarantee we can free the extent. It's far more efficient, and the infrastructure we have in place already supports this sort of functionality.... When we go back to the original problem of the AGFL allocation needing to roll the transaction to get busy extents processed, then we could have it return -EAGAIN all the way back up to the defer-ops level and we simply then roll the transaction and retry the same extent free operation again. i.e. where extent freeing needs to block waiting on stuff like busy extents, we could now have these commit the current transaction, push the current work item to the back of the current context's queue and come back to it later. At this point, I think the "single extent per transaction" constraint that is needed to avoid the busy extent deadlock goes away, and with it the need for limiting EFI processing to a single extent goes away too.... > And your implementation may use more log space than mine in case the EFI > (and EFD pair) can’t be cancelled. :D True, but it's really not a concern. Don't conflate "intent recovery intent amplification can cause log space deadlocks" with "intent size is a problem". :) > The only difference if that you use transaction commit and I am using transaction roll > which is not safe as you said. > > Is my understanding correct? I think I understand where we are misunderstanding each other :) Perhaps it is now obvious to you as well from the analysis above. If so, ignore the rest of this :) What does "transaction roll" actually mean? XFS has a concept of "rolling transactions". These are made up of a series of individual transaction contexts that are linked together by passing a single log reservation ticket between transaction contexts. xfs_trans_roll() effectively does: ntp = xfs_trans_dup(tp) .... xfs_trans_commit(tp) return ntp; i.e. it duplicates the current transaction handle, takes the unused block reservation from it, grabs the log reservation ticket from it, moves the defered ops from the old to the new transaction context, then commits the old transaction context and returns the new one. tl;dr: a rolling transaction is a series of linked independent transactions that share a common set of block and log reservations. To make a rolling transaction chain an atomic operation on a specific object (e.g. an inode) that object has to remain locked and be logged in every transaction in the chain of rolling transactions. This keeps it moving forward in the log and prevents it from pinning the tail of the log and causing log space deadlocks. Fundamentally, though, each individual transaction in a rolling transaction is an independent atomic change set. So when you say "roll the transaction", what you are actually saying is "commit the current transaction and start a new transaction using the reservations the current transaction already holds." When I say "transaction commit" I am only refering to the process that closes off the current transaction. If this is in the middle of a rolling transaction, then what I am talking about is xfs_trans_roll() calling xfs_trans_commit() after it has duplicated the current transaction context. Transaction contexts running defered operations, intent chains, etc are *always* rolling transactions, and each time we roll the transaction we commit it. IOWS, when I say "transaction commit" and you say "transaction roll" we are talking about exactly the same thing - transaction commit is the operation that roll performs to finish off the current change set... Ideally, nobody should be calling xfs_trans_roll() directly anymore. All rolling transactions should be implemented using deferred ops nowdays - xfs_trans_roll() is the historic method of running rolling transactions. e.g. see the recent rework of the attribute code to use deferred ops rather than explicit calls to xfs_trans_roll() so we can use intents for running attr operations. These days the transaction model is based around chains of deferred operations. We attach deferred work to the transaction, and then when we call xfs_trans_commit() it goes off into the deferred work infrastructure that creates intents and manages the transaction context for processing the intents itself. This is the primary reason we are trying to move away from high level code using transaction rolling - we can't easily change the way we process operations when the high level code controls transaction contexts. Using deferred intent chains gives us fine grained control over transaction context in the deferred ops processing code - we can roll to a new transaction whenever we need to.... > One question is that if only one EFI is safe per transaction, how > we ensure that there is only one EFI per transaction in case there > are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to > free in current code? That will handled exactly the same way it does with your change - it will simply split up the work items so there are multiple intents logged. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx