Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 13 Mar 2025 12:25:21 +1100

On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > the CoW range and commit the transaction.
> > > > > 
> > > > > This refactoring will be used in future for when it is required to commit
> > > > > a range of extents as a single transaction, similar to how it was done
> > > > > pre-commit d6f215f359637.
> > > > 
> > > > Darrick pointed out that if you do more than just a tiny number
> > > > of extents per transactions you run out of log reservations very
> > > > quickly here:
> > > > 
> > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > 
> > > > how does your scheme deal with that?
> > > > 
> > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > right? Or does the log reservation have a hard size limit, regardless of
> > > that calculation?
> > 
> > The resblks calculated there are the reserved disk blocks

Used for btree block allocations that might be needed during the
processing of the transaction.

> > and have
> > nothing to do with the log reservations, which comes from the
> > tr_write field passed in.  There is some kind of upper limited to it
> > obviously by the log size, although I'm not sure if we've formalized
> > that somewhere.  Dave might be the right person to ask about that.
> 
> The (very very rough) upper limit for how many intent items you can
> attach to a tr_write transaction is:
> 
> per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> max_blocks = tr_write::tr_logres / per_extent_cost
> 
> (ili_size is the inode log item size)

That doesn't sound right. The number of intents we can log is not
dependent on the aggregated size of all intent types. We do not log
all those intent types in a single transaction, nor do we process
more than one type of intent in a given transaction. Also, we only
log the inode once per transaction, so that is not a per-extent
overhead.

Realistically, the tr_write transaction is goign to be at least a
100kB because it has to be big enough to log full splits of multiple
btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
filesystem spits out:

xfs_trans_resv_calc:  dev 7:0 type 0 logres 193528 logcount 5 flags 0x4

About 190kB.

However, intents are typically very small - around 32 bytes in size
plus another 12 bytes for the log region ophdr.

This implies that we can fit thousands of individual intents in a
single tr_write log reservation on any given filesystem, and the
number of loop iterations in a transaction is therefore dependent
largely on how many intents are logged per iteration.

Hence if we are walking a range of extents in the BMBT to unmap
them, then we should only be generating 2 intents per loop - a BUI
for the BMBT removal and a CUI for the shared refcount decrease.
That means we should be able to run at least a thousand iterations
of that loop per transaction without getting anywhere near the
transaction reservation limits.

*However!*

We have to relog every intent we haven't processed in the deferred
batch every-so-often to prevent the outstanding intents from pinning
the tail of the log. Hence the larger the number of intents in the
initial batch, the more work we have to do later on (and the more
overall log space and bandwidth they will consume) to relog them
them over and over again until they pop to the head of the
processing queue.

Hence there is no real perforamce advantage to creating massive intent
batches because we end up doing more work later on to relog those
intents to prevent journal space deadlocks. It also doesn't speed up
processing, because we still process the intent chains one at a time
from start to completion before moving on to the next high level
intent chain that needs to be processed.

Further, after the first couple of intent chains have been
processed, the initial log space reservation will have run out, and
we are now asking for a new resrevation on every transaction roll we
do. i.e. we now are now doing a log space reservation on every
transaction roll in the processing chain instead of only doing it
once per high level intent chain.

Hence from a log space accounting perspective (the hottest code path
in the journal), it is far more efficient to perform a single high
level transaction per extent unmap operation than it is to batch
intents into a single high level transaction.

My advice is this: we should never batch high level iterative
intent-based operations into a single transaction because it's a
false optimisation.  It might look like it is an efficiency
improvement from the high level, but it ends up hammering the hot,
performance critical paths in the transaction subsystem much, much
harder and so will end up being slower than the single transaction
per intent-based operation algorithm when it matters most....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx