> On Apr 24, 2023, at 8:53 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: > > > >> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >> >> >> >>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >>> >>> >>> >>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>> >>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>> We don't do that anymore for partially processed multi-extent >>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>>>>> we log a complete intent done items alongside a new intent >>>>>> containing the remaining work to be done in the same transaction. >>>>>> This cancels the original intent and atomically replaces it with a >>>>>> new intent containing the remaining work to be done. >>>>>> >>>>>> So when I say "update the EFD" I'm using historic terminology for >>>>>> processing and recovering multi-extent intents. In modern terms, >>>>>> what I mean is "update the deferred work intent chain to reflect the >>>>>> work remaining to be done". >>>>> >>>>> OK. so let’s see the difference between your implementation from mine. >>>>> Say, there are three extents to free. >>>>> >>>>> My patch would result in: >>>>> >>>>> EFI 1 with extent1 >>>>> free extent1 >>>>> EFD 1 with extent1 >>>>> transaction roll >>>>> EFI 2 with extent2 >>>>> free extent2 >>>>> EFD 2 with extent2 >>>>> transaction roll >>>>> EFI 3 with extent3 >>>>> free extent3 >>>>> EFD3 with extent3 >>>>> transaction commit >>>> >>>> No, it wouldn't. This isn't how the deferred work processes work >>>> items on the transaction. A work item with multiple extents on it >>>> would result in this: >>>> >>>> xfs_defer_finish(tp) # tp contains three xefi work items >>>> xfs_defer_finish_noroll >>>> xfs_defer_create_intents() >>>> list_for_each_defer_pending >>>> xfs_defer_create_intent(dfp) >>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); >>>> xfs_extent_free_create_intent() >>>> <create EFI> >>>> list_for_each_xefi >>>> xfs_extent_free_log_item(xefi) >>>> <adds extent to current EFI> >>>> >>>> xfs_defer_trans_roll() >>>> <save> >>>> xfs_trans_roll() >>>> xfs_trans_dup() >>>> xfs_trans_commit() >>>> <restore> >>>> >>>> At this point we have this committed to the CIL >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> >>>> And xfs_defer_finish_noroll() continues with >>>> >>>> <grabs first work item> >>>> xfs_defer_finish_one(dfp) >>>> ->create_done(EFI 1) >>>> xfs_extent_free_create_done >>>> <create EFD> >>>> list_for_each_xefi >>>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>>> xfs_extent_free_finish_item() >>>> xfs_trans_free_extent >>>> __xfs_free_extent >>>> <adds extent to EFD> >>>> >>>> And once the processing of the single work item is done we loop >>>> back to the start of the xfs_defer_finish_noroll() loop. We don't >>>> have any new intents, so xfs_defer_create_intents() returns false, >>>> but we completed a dfp work item, so we run: >>>> >>>> xfs_defer_trans_roll() >>>> <save> >>>> xfs_trans_roll() >>>> xfs_trans_dup() >>>> xfs_trans_commit() >>>> <restore> >>>> >>>> At this point we have this committed to the CIL >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 >>>> >>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run >>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log: >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 2 with extent2 >>>> >>>> But we have not committed the final extent free or EFD 3 - that's in >>>> the last transaction context we pass back to the _xfs_trans_commit() >>>> context for it to finalise and close off the rolling transaction >>>> chain. At this point, we finally have this in the CIL: >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 2 with extent2 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 3 with extent3 >>> >>> >>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() >>> thinking it only create intent for the first in tp->t_dfops. >>> >>>> >>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. >>>> >>>> I *always* ignore CIL intent whiteouts when thinking about >>>> transaction chains and intents. That is purely a journal efficiency >>>> optimisation, not something that is necessary for correct operation. >>> >>> OK. >>> >>>> >>>>> Your idea yields this: >>>>> >>>>> EFI 1 with extent1 extent2 extent3 >>>>> free extent1 >>>>> EFI 2 with extent2 extent3 >>>>> EFD 1 with extent1 extent2 extent3 >>>>> transaction commit >>>>> create transaction >>>>> free extent2 >>>>> EFI 3 with extent3 >>>>> EFD 2 with extent extent2 extent3 >>>>> transaction commit >>>>> create transaction >>>>> free extent3 >>>>> EFD 3 with extent3 >>>>> transaction commit. >>>> >>>> The EFI/EFD contents are correct, but the rest of it is not - I am >>>> not suggesting open coding transaction rolling like that. Everything >>>> I am suggesting works through the same defer ops mechanism as you >>>> are describing. So if we start with the initial journal contents >>>> looks like this: >>>> >>>> EFI 1 with extent1 extent2 extent3. >>>> >>>> Then we run xfs_defer_finish_one() on that work, >>>> >>>> xfs_defer_finish_one(dfp) >>>> ->create_done(EFI 1) >>>> xfs_extent_free_create_done >>>> <create EFD> >>>> list_for_each_xefi >>>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>>> xfs_extent_free_finish_item() >>>> xfs_trans_free_extent >>>> __xfs_free_extent >>>> <adds extent to EFD> >>>> >>>> But now we have 3 xefis on the work to process. So on success of >>>> the first call to xfs_trans_free_extent(), we want it to return >>>> -EAGAIN to trigger the existing relogging code to create the new >>>> EFI. How this works is describe in the section "Requesting a >>>> Fresh Transaction while Finishing Deferred Work" in >>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. >>>> >>>> The result is that the deferred work infrastructure does the work >>>> of updaing the done intent and creating the new intents for the work >>>> remaining. Hence after the next transaction roll, we have in the CIL >>>> >>>> EFI 1 with extent1 extent2 extent3. >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 extent2 extent3. >>>> EFI 2 with extent2 extent3. >>>> >>> >>> Taking transaction rolls into account (also adding up to EFD3), above would be: >>> >>> EFI 1 with extent1 extent2 extent3. >>> transaction roll >>> <AGF, AGFL, free space btree block mods> for extent 1 >>> EFD 1 with extent1 extent2 extent3. >>> EFI 2 with extent2 extent3. >>> transaction roll >>> free extent 2 >>> EFD 2 with extent2 extent3 >>> EFI 3 with extent3 >>> transaction roll >>> free extent 3 >>> EFD 3 with extent3 >>> >>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. >>> I got it. >>> >>>> And so the loop goes until there is no more work remaining. >>>> >>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing >>>>> XFS_EFI_MAX_FAST_EXTENTS. >>>> >>>> The difference is that what I'm suggesting is that the extent free >>>> code can decide when it needs a transaction to be rolled. It isn't >>>> forced to always run a single free per transaction, it can decide >>>> that it can free multiple extents per transaction if there is no >>>> risk of deadlocks (e.g. extents are in different AGs). Forcing >>>> everything to be limited to a transaction per extent free even when >>>> there is no risk of deadlocks freeing multiple extents in a single >>>> transaction is unnecessary. >>>> >>>> Indeed, if the second extent is in a different AG, we don't risk >>>> busy extents causing us issues, so we could do: >>>> >>>> EFI 1 with extent1 extent2 extent3. >>>> <AGF 1, AGFL 1, free space btree block mods> >>>> <AGF 2, AGFL 2, free space btree block mods> >>>> EFD 1 with extent1 extent2 extent3. >>>> EFI 2 with extent3. >>>> ..... >>>> >>> >>> My thumb is up. >> >> Well, I am wondering if there is ABBA deadlock instead of self deadlock then. >> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL >> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1 >> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening? >> >> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number >> are stored there. For those AGs >> they have pending busy extents in in-memory transactions (not committed to CIL). >> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent() >> if it’s not there, and continue to free the extent. Otherwise if the AG number is already >> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when >> the busy extents are committed to the xfs_cil_ctx. >> Well, wondering if the pending busy extents would remain in pending state long before >> committed to CIL and that become a bottle neck for freeing extents. > > Thinking more, we could add per AG counter indicating current pending busy extents on the AG. > For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent > is always OK to go, there shouldn’t be a bottle neck. I just sent the first attempt (this way), pls review. thanks, wengang