Re: [PATCH 1/2] xfs: IO time one extent per EFI

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Apr 24, 2023, at 8:53 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>> We don't do that anymore for partially processed multi-extent
>>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>>>> we log a complete intent done items alongside a new intent
>>>>>> containing the remaining work to be done in the same transaction.
>>>>>> This cancels the original intent and atomically replaces it with a
>>>>>> new intent containing the remaining work to be done.
>>>>>> 
>>>>>> So when I say "update the EFD" I'm using historic terminology for
>>>>>> processing and recovering multi-extent intents. In modern terms,
>>>>>> what I mean is "update the deferred work intent chain to reflect the
>>>>>> work remaining to be done".
>>>>> 
>>>>> OK. so let’s see the difference between your implementation from mine.
>>>>> Say, there are three extents to free.
>>>>> 
>>>>> My patch would result in:
>>>>> 
>>>>> EFI 1  with extent1
>>>>> free extent1
>>>>> EFD 1 with extent1
>>>>> transaction roll
>>>>> EFI 2 with extent2
>>>>> free extent2
>>>>> EFD 2 with extent2
>>>>> transaction roll
>>>>> EFI 3 with extent3
>>>>> free extent3
>>>>> EFD3 with extent3
>>>>> transaction commit
>>>> 
>>>> No, it wouldn't. This isn't how the deferred work processes work
>>>> items on the transaction. A work item with multiple extents on it
>>>> would result in this:
>>>> 
>>>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>>>> xfs_defer_finish_noroll
>>>> xfs_defer_create_intents()
>>>>   list_for_each_defer_pending
>>>>     xfs_defer_create_intent(dfp)
>>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>>> xfs_extent_free_create_intent()
>>>>   <create EFI>
>>>>   list_for_each_xefi
>>>>     xfs_extent_free_log_item(xefi)
>>>> <adds extent to current EFI>
>>>> 
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>>   xfs_trans_dup()
>>>>   xfs_trans_commit()
>>>> <restore>
>>>> 
>>>> At this point we have this committed to the CIL
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> 
>>>> And xfs_defer_finish_noroll() continues with
>>>> 
>>>> <grabs first work item>
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>>   xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>>   ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>>     xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>>   <adds extent to EFD>
>>>> 
>>>> And once the processing of the single work item is done we loop
>>>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>>>> have any new intents, so xfs_defer_create_intents() returns false,
>>>> but we completed a dfp work item, so we run:
>>>> 
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>>   xfs_trans_dup()
>>>>   xfs_trans_commit()
>>>> <restore>
>>>> 
>>>> At this point we have this committed to the CIL
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> 
>>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> 
>>>> But we have not committed the final extent free or EFD 3 - that's in
>>>> the last transaction context we pass back to the _xfs_trans_commit()
>>>> context for it to finalise and close off the rolling transaction
>>>> chain. At this point, we finally have this in the CIL:
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 3 with extent3
>>> 
>>> 
>>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
>>> thinking it only create intent for the first in tp->t_dfops.
>>> 
>>>> 
>>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>>>> 
>>>> I *always* ignore CIL intent whiteouts when thinking about
>>>> transaction chains and intents. That is purely a journal efficiency
>>>> optimisation, not something that is necessary for correct operation.
>>> 
>>> OK.
>>> 
>>>> 
>>>>> Your idea yields this:
>>>>> 
>>>>> EFI 1 with extent1 extent2 extent3
>>>>> free extent1
>>>>> EFI 2 with extent2 extent3
>>>>> EFD 1 with extent1 extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent2
>>>>> EFI 3 with extent3
>>>>> EFD 2 with extent extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent3
>>>>> EFD 3 with extent3
>>>>> transaction commit.
>>>> 
>>>> The EFI/EFD contents are correct, but the rest of it is not - I am
>>>> not suggesting open coding transaction rolling like that. Everything
>>>> I am suggesting works through the same defer ops mechanism as you
>>>> are describing. So if we start with the initial journal contents
>>>> looks like this:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> 
>>>> Then we run xfs_defer_finish_one() on that work, 
>>>> 
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>>   xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>>   ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>>     xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>>   <adds extent to EFD>
>>>> 
>>>> But now we have 3 xefis on the work to process. So on success of
>>>> the first call to xfs_trans_free_extent(), we want it to return
>>>> -EAGAIN to trigger the existing relogging code to create the new
>>>> EFI. How this works is describe in the section "Requesting a
>>>> Fresh Transaction while Finishing Deferred Work" in
>>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>>>> 
>>>> The result is that the deferred work infrastructure does the work
>>>> of updaing the done intent and creating the new intents for the work
>>>> remaining. Hence after the next transaction roll, we have in the CIL
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent2 extent3.
>>>> 
>>> 
>>> Taking transaction rolls into account (also adding up to EFD3), above would be:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> transaction roll
>>> <AGF, AGFL, free space btree block mods>  for extent 1
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent2 extent3.
>>> transaction roll
>>> free extent 2
>>> EFD 2 with extent2 extent3
>>> EFI 3 with extent3
>>> transaction roll
>>> free extent 3
>>> EFD 3 with extent3
>>> 
>>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
>>> I got it.
>>> 
>>>> And so the loop goes until there is no more work remaining.
>>>> 
>>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>>>> XFS_EFI_MAX_FAST_EXTENTS.
>>>> 
>>>> The difference is that what I'm suggesting is that the extent free
>>>> code can decide when it needs a transaction to be rolled. It isn't
>>>> forced to always run a single free per transaction, it can decide
>>>> that it can free multiple extents per transaction if there is no
>>>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>>>> everything to be limited to a transaction per extent free even when
>>>> there is no risk of deadlocks freeing multiple extents in a single
>>>> transaction is unnecessary.
>>>> 
>>>> Indeed, if the second extent is in a different AG, we don't risk
>>>> busy extents causing us issues, so we could do:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF 1, AGFL 1, free space btree block mods>
>>>> <AGF 2, AGFL 2, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent3.
>>>> .....
>>>> 
>>> 
>>> My thumb is up.
>> 
>> Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
>> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
>> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
>> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?
>> 
>> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
>> are stored there. For those AGs
>> they have pending busy extents in in-memory transactions (not committed to CIL).
>> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
>> if it’s not there, and continue to free the extent. Otherwise if the AG number is already
>> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
>> the busy extents are committed to the xfs_cil_ctx.
>> Well, wondering if the pending busy extents would remain in pending state long before
>> committed to CIL and that become a bottle neck for freeing extents.  
> 
> Thinking more, we could add per AG counter indicating current pending busy extents on the AG.
> For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent
> is always OK to go, there shouldn’t be a bottle neck.

I just sent the first attempt (this way), pls review.

thanks,
wengang





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux