Re: [PATCH 1/2] xfs: IO time one extent per EFI

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 21 Apr 2023 19:34:56 +1000

On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
> > On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > We don't do that anymore for partially processed multi-extent
> > intents anymore. Instead, we use deferred ops to chain updates. i.e.
> > we log a complete intent done items alongside a new intent
> > containing the remaining work to be done in the same transaction.
> > This cancels the original intent and atomically replaces it with a
> > new intent containing the remaining work to be done.
> > 
> > So when I say "update the EFD" I'm using historic terminology for
> > processing and recovering multi-extent intents. In modern terms,
> > what I mean is "update the deferred work intent chain to reflect the
> > work remaining to be done".
> 
> OK. so let’s see the difference between your implementation from mine.
> Say, there are three extents to free.
> 
> My patch would result in:
> 
> EFI 1  with extent1
> free extent1
> EFD 1 with extent1
> transaction roll
> EFI 2 with extent2
> free extent2
> EFD 2 with extent2
> transaction roll
> EFI 3 with extent3
> free extent3
> EFD3 with extent3
> transaction commit

No, it wouldn't. This isn't how the deferred work processes work
items on the transaction. A work item with multiple extents on it
would result in this:

xfs_defer_finish(tp)  # tp contains three xefi work items 
  xfs_defer_finish_noroll
    xfs_defer_create_intents()
      list_for_each_defer_pending
        xfs_defer_create_intent(dfp)
	  ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
	    xfs_extent_free_create_intent()
	      <create EFI>
	      list_for_each_xefi
	        xfs_extent_free_log_item(xefi)
		  <adds extent to current EFI>

  xfs_defer_trans_roll()
    <save>
    xfs_trans_roll()
      xfs_trans_dup()
      xfs_trans_commit()
    <restore>

At this point we have this committed to the CIL

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3

And xfs_defer_finish_noroll() continues with

  <grabs first work item>
  xfs_defer_finish_one(dfp)
    ->create_done(EFI 1)
      xfs_extent_free_create_done
	<create EFD>
    list_for_each_xefi
      ops->finish_item(tp, dfp->dfp_done, li, &state);
        xfs_extent_free_finish_item()
	  xfs_trans_free_extent
	    __xfs_free_extent
	      <adds extent to EFD>

And once the processing of the single work item is done we loop
back to the start of the xfs_defer_finish_noroll() loop. We don't
have any new intents, so xfs_defer_create_intents() returns false,
but we completed a dfp work item, so we run:

  xfs_defer_trans_roll()
    <save>
    xfs_trans_roll()
      xfs_trans_dup()
      xfs_trans_commit()
    <restore>

At this point we have this committed to the CIL

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1

Then we run xfs_defer_finish_one() on EFI 2, commit, then run
xfs_defer_finish_one() on EFI 3. At this point, we have in the log:

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1
<AGF, AGFL, free space btree block mods>
EFD 2 with extent2

But we have not committed the final extent free or EFD 3 - that's in
the last transaction context we pass back to the _xfs_trans_commit()
context for it to finalise and close off the rolling transaction
chain. At this point, we finally have this in the CIL:

EFI 1 with extent1
EFI 2 with extent2
EFI 3 with extent3
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1
<AGF, AGFL, free space btree block mods>
EFD 2 with extent2
<AGF, AGFL, free space btree block mods>
EFD 3 with extent3

> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.

I *always* ignore CIL intent whiteouts when thinking about
transaction chains and intents. That is purely a journal efficiency
optimisation, not something that is necessary for correct operation.

> Your idea yields this:
> 
> EFI 1 with extent1 extent2 extent3
> free extent1
> EFI 2 with extent2 extent3
> EFD 1 with extent1 extent2 extent3
> transaction commit
> create transaction
> free extent2
> EFI 3 with extent3
> EFD 2 with extent extent2 extent3
> transaction commit
> create transaction
> free extent3
> EFD 3 with extent3
> transaction commit.

The EFI/EFD contents are correct, but the rest of it is not - I am
not suggesting open coding transaction rolling like that. Everything
I am suggesting works through the same defer ops mechanism as you
are describing. So if we start with the initial journal contents
looks like this:

EFI 1 with extent1 extent2 extent3.

Then we run xfs_defer_finish_one() on that work, 

  xfs_defer_finish_one(dfp)
    ->create_done(EFI 1)
      xfs_extent_free_create_done
	<create EFD>
    list_for_each_xefi
      ops->finish_item(tp, dfp->dfp_done, li, &state);
        xfs_extent_free_finish_item()
	  xfs_trans_free_extent
	    __xfs_free_extent
	      <adds extent to EFD>

But now we have 3 xefis on the work to process. So on success of
the first call to xfs_trans_free_extent(), we want it to return
-EAGAIN to trigger the existing relogging code to create the new
EFI. How this works is describe in the section "Requesting a
Fresh Transaction while Finishing Deferred Work" in
fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.

The result is that the deferred work infrastructure does the work
of updaing the done intent and creating the new intents for the work
remaining. Hence after the next transaction roll, we have in the CIL

EFI 1 with extent1 extent2 extent3.
<AGF, AGFL, free space btree block mods>
EFD 1 with extent1 extent2 extent3.
EFI 2 with extent2 extent3.

And so the loop goes until there is no more work remaining.

> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
> So actually it’s still one extent per EFI per transaction. Though you are not changing
> XFS_EFI_MAX_FAST_EXTENTS.

The difference is that what I'm suggesting is that the extent free
code can decide when it needs a transaction to be rolled. It isn't
forced to always run a single free per transaction, it can decide
that it can free multiple extents per transaction if there is no
risk of deadlocks (e.g. extents are in different AGs).  Forcing
everything to be limited to a transaction per extent free even when
there is no risk of deadlocks freeing multiple extents in a single
transaction is unnecessary.

Indeed, if the second extent is in a different AG, we don't risk
busy extents causing us issues, so we could do:

EFI 1 with extent1 extent2 extent3.
<AGF 1, AGFL 1, free space btree block mods>
<AGF 2, AGFL 2, free space btree block mods>
EFD 1 with extent1 extent2 extent3.
EFI 2 with extent3.
.....

The difference is that limiting the number of xefi items per
deferred work item means we can only process a single extent per
work item regardless of the current context.

Using the existing defered work "on demand transaction rolling"
mechanism I'm suggesting we use doesn't put any artificial
constrains on how we log and process the intents. This allows us to
aggregate multiple extent frees into a single transaction when there
is no risk associated with doing so and we have sufficient
transaction reservations remaining to guarantee we can free the
extent. It's far more efficient, and the infrastructure we have in
place already supports this sort of functionality....

When we go back to the original problem of the AGFL allocation
needing to roll the transaction to get busy extents processed, then
we could have it return -EAGAIN all the way back up to the defer-ops
level and we simply then roll the transaction and retry the same
extent free operation again. i.e. where extent freeing needs to
block waiting on stuff like busy extents, we could now have these
commit the current transaction, push the current work item to the
back of the current context's queue and come back to it later.

At this point, I think the "single extent per transaction"
constraint that is needed to avoid the busy extent deadlock goes
away, and with it the need for limiting EFI processing to a single
extent goes away too....

> And your implementation may use more log space than mine in case the EFI
> (and EFD pair) can’t be cancelled.  :D

True, but it's really not a concern.  Don't conflate "intent
recovery intent amplification can cause log space deadlocks" with
"intent size is a problem". :)

> The only difference if that you use transaction commit and I am using transaction roll
> which is not safe as you said. 
> 
> Is my understanding correct?

I think I understand where we are misunderstanding each other :)
Perhaps it is now obvious to you as well from the analysis above.
If so, ignore the rest of this :)

What does "transaction roll" actually mean?

XFS has a concept of "rolling transactions". These are made up of a
series of individual transaction contexts that are linked together
by passing a single log reservation ticket between transaction
contexts.

xfs_trans_roll() effectively does:

	ntp = xfs_trans_dup(tp)
	....
	xfs_trans_commit(tp)

	return ntp;

i.e. it duplicates the current transaction handle, takes the unused
block reservation from it, grabs the log reservation ticket from it,
moves the defered ops from the old to the new transaction context,
then commits the old transaction context and returns the new one.

tl;dr: a rolling transaction is a series of linked independent
transactions that share a common set of block and log reservations.

To make a rolling transaction chain an atomic operation on a
specific object (e.g. an inode) that object has to remain locked and
be logged in every transaction in the chain of rolling transactions.
This keeps it moving forward in the log and prevents it from pinning
the tail of the log and causing log space deadlocks.

Fundamentally, though, each individual transaction in a rolling
transaction is an independent atomic change set. So when you say
"roll the transaction", what you are actually saying is "commit the
current transaction and start a new transaction using the
reservations the current transaction already holds."

When I say "transaction commit" I am only refering to the process
that closes off the current transaction. If this is in the middle of
a rolling transaction, then what I am talking about is
xfs_trans_roll() calling xfs_trans_commit() after it has duplicated
the current transaction context.

Transaction contexts running defered operations, intent chains, etc
are *always* rolling transactions, and each time we roll the
transaction we commit it.

IOWS, when I say "transaction commit" and you say "transaction roll"
we are talking about exactly the same thing - transaction commit is
the operation that roll performs to finish off the current change
set...

Ideally, nobody should be calling xfs_trans_roll() directly anymore.
All rolling transactions should be implemented using deferred ops
nowdays - xfs_trans_roll() is the historic method of running rolling
transactions. e.g. see the recent rework of the attribute code to
use deferred ops rather than explicit calls to xfs_trans_roll() so
we can use intents for running attr operations.

These days the transaction model is based around chains of deferred
operations. We attach deferred work to the transaction, and then
when we call xfs_trans_commit() it goes off into the deferred work
infrastructure that creates intents and manages the transaction
context for processing the intents itself.

This is the primary reason we are trying to move away from high
level code using transaction rolling - we can't easily change the
way we process operations when the high level code controls
transaction contexts. Using deferred intent chains gives us fine
grained control over transaction context in the deferred ops
processing code - we can roll to a new transaction whenever we need
to....

> One question is that if only one EFI is safe per transaction, how
> we ensure that there is only one EFI per transaction in case there
> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to
> free in current code?

That will handled exactly the same way it does with your change - it
will simply split up the work items so there are multiple intents
logged.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx