Re: [PATCH 3/6] xfs: xfs_itruncate_extents has no extent count limitation

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 1 Jun 2021 09:28:20 +1000

On Mon, May 31, 2021 at 06:35:40PM +0530, Chandan Babu R wrote:
> On 31 May 2021 at 18:25, Chandan Babu R wrote:
> > On 27 May 2021 at 10:21, Dave Chinner wrote:
> >> From: Dave Chinner <dchinner@xxxxxxxxxx>
> >>
> >> Ever since we moved to freeing of extents by deferred operations,
> >> we've already freed extents via individual transactions. Hence the
> >> only limitation of how many extents we can mark for freeing in a
> >> single xfs_bunmapi() call bound only by how many deferrals we want
> >> to queue.
> >>
> >> That is xfs_bunmapi() doesn't actually do any AG based extent
> >> freeing, so there's no actually transaction reservation used up by
> >> calling bunmapi with a large count of extents to be freed. RT
> >> extents have always been freed directly by bunmapi, but that doesn't
> >> require modification of large number of blocks as there are no
> >> btrees to split.
> >>
> >> Some callers of xfs_bunmapi assume that the extent count being freed
> >> is bound by geometry (e.g. directories) and these can ask bunmapi to
> >> free up to 64 extents in a single call. These functions just work as
> >> tehy stand, so there's no reason for truncate to have a limit of
> >> just two extents per bunmapi call anymore.
> >>
> >> Increase XFS_ITRUNC_MAX_EXTENTS to 64 to match the number of extents
> >> that can be deferred in a single loop to match what the directory
> >> code already uses.
> >>
> >> For realtime inodes, where xfs_bunmapi() directly frees extents,
> >> leave the limit at 2 extents per loop as this is all the space that
> >> the transaction reservation will cover.
> >>
> >> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> >> ---
> >>  fs/xfs/xfs_inode.c | 15 ++++++++++++---
> >>  1 file changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> >> index 0369eb22c1bb..db220eaa34b8 100644
> >> --- a/fs/xfs/xfs_inode.c
> >> +++ b/fs/xfs/xfs_inode.c
> >> @@ -40,9 +40,18 @@ kmem_zone_t *xfs_inode_zone;
> >>
> >>  /*
> >>   * Used in xfs_itruncate_extents().  This is the maximum number of extents
> >> - * freed from a file in a single transaction.
> >> + * we will unmap and defer for freeing in a single call to xfs_bunmapi().
> >> + * Realtime inodes directly free extents in xfs_bunmapi(), so are bound
> >> + * by transaction reservation size to 2 extents.
> >>   */
> >> -#define	XFS_ITRUNC_MAX_EXTENTS	2
> >> +static inline int
> >> +xfs_itrunc_max_extents(
> >> +	struct xfs_inode	*ip)
> >> +{
> >> +	if (XFS_IS_REALTIME_INODE(ip))
> >> +		return 2;
> >> +	return 64;
> >> +}
> >>
> >>  STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
> >>  STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
> >> @@ -1402,7 +1411,7 @@ xfs_itruncate_extents_flags(
> >>  	while (unmap_len > 0) {
> >>  		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> >>  		error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len,
> >> -				flags, XFS_ITRUNC_MAX_EXTENTS);
> >> +				flags, xfs_itrunc_max_extents(ip));
> >>  		if (error)
> >>  			goto out;
> >
> > The list of free extent items at xfs_defer_pending->dfp_work could
> > now contain XFS_EFI_MAX_FAST_EXTENTS (i.e. 16) entries in the worst case.

Yes, but we do exactly this when freeing a large fragmented directly
block. That is, we ask xfs_bunmapi to unmap a 64kB range regardless
of how many extents map that range.

IOWs, the limitation in extent count placed in
xfs_itruncate_extents() doesn't actually address the underlying
problem - it's just a band-aid that has been placed over the easy to
trigger transaction overrun symptom that has always been present in
the underlying extent freeing code.

> > For a single transaction, xfs_calc_itruncate_reservation() reserves space for
> > logging only 4 extents (i.e. 4 exts * 2 trees * (2 * max depth - 1) * block
> > size).
> 
> ... Sorry, I meant to say "xfs_calc_itruncate_reservation() reserves log space
> required for freeing 4 extents ..."

My point exactly - the code has always had a mismatch between
reservations and what we can stuff into an EFI.  EFIs are unbound in
size, so having a fixed "4 extents per EFI" reservation limit has
never made any real sense given that unmaping a 64kB directory block
on a 1kb block size filesystem has a worst case of freeing 64
extents in a single transaction. As I said above, the limitations
placed on xfs_itruncate_extents is largely a hack because it doesn't
address other avenues to the same overruns...

This "4 extents per transaction" reservation makes even less sense
now that extents are freed by defer ops, not by the transaction that
unmaps them. We've completely decoupled extent freeing from the
higher level code that unmaps them, and so now the freeing
transactions are independent of the high level code that runs the
freeing operations.

IOWs, having a reservation big enough to free a single extent is all
we should need now as we can break the extent freeing up into
individual transactions and still have them complete atomically even
after a crash. That's where I'm trying to get to here, but it's
clear that I need to refine the code a bit further such that we only
allow individual EFIs to queue up and free multiple extents within
an AG to be freed in the same transaction....

FWIW, I suspect that the right thing to do here is make use of the
xfs_defer_finish_one() mechanism for relogging the remaining intents
in work list while it is processing that list. All we need is for
xfs_extent_free_finish_item() to return -EAGAIN instead of running
the free transaction, and we will log the completed EFDs and the
remaining EFIs to be run and roll the transaction.

Hence we can control exactly when we roll the extent freeing
transaction (e.g. when the next extent to free is in a different AG
to the one we've just been freeing extents in) and as a result we
can define a fixed transaction reservation for extent freeing that
works in every situation without having to care about arbitrary
"how many extents can we unmap without overrun" concerns.

That's the goal here - fixed, small reservation for extent freeing
that is decoupled from and independent of the number of extents that
need to be freed atomically by the high level operation...

Hence I think that a minor amount of rework will allow the EFI code
to log large numbers of extents to be freed whilst still processing
them within single AG free space tree modification reservation
bounds...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx