Re: [PATCH 3/6] xfs: xfs_itruncate_extents has no extent count limitation

Chandan Babu R <chandanrlinux@xxxxxxxxx> · Tue, 01 Jun 2021 12:12:36 +0530

On 01 Jun 2021 at 04:58, Dave Chinner wrote:
> On Mon, May 31, 2021 at 06:35:40PM +0530, Chandan Babu R wrote:
>> On 31 May 2021 at 18:25, Chandan Babu R wrote:
>> > On 27 May 2021 at 10:21, Dave Chinner wrote:
>> >> From: Dave Chinner <dchinner@xxxxxxxxxx>
>> >>
>> >> Ever since we moved to freeing of extents by deferred operations,
>> >> we've already freed extents via individual transactions. Hence the
>> >> only limitation of how many extents we can mark for freeing in a
>> >> single xfs_bunmapi() call bound only by how many deferrals we want
>> >> to queue.
>> >>
>> >> That is xfs_bunmapi() doesn't actually do any AG based extent
>> >> freeing, so there's no actually transaction reservation used up by
>> >> calling bunmapi with a large count of extents to be freed. RT
>> >> extents have always been freed directly by bunmapi, but that doesn't
>> >> require modification of large number of blocks as there are no
>> >> btrees to split.
>> >>
>> >> Some callers of xfs_bunmapi assume that the extent count being freed
>> >> is bound by geometry (e.g. directories) and these can ask bunmapi to
>> >> free up to 64 extents in a single call. These functions just work as
>> >> tehy stand, so there's no reason for truncate to have a limit of
>> >> just two extents per bunmapi call anymore.
>> >>
>> >> Increase XFS_ITRUNC_MAX_EXTENTS to 64 to match the number of extents
>> >> that can be deferred in a single loop to match what the directory
>> >> code already uses.
>> >>
>> >> For realtime inodes, where xfs_bunmapi() directly frees extents,
>> >> leave the limit at 2 extents per loop as this is all the space that
>> >> the transaction reservation will cover.
>> >>
>> >> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
>> >> ---
>> >>  fs/xfs/xfs_inode.c | 15 ++++++++++++---
>> >>  1 file changed, 12 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> >> index 0369eb22c1bb..db220eaa34b8 100644
>> >> --- a/fs/xfs/xfs_inode.c
>> >> +++ b/fs/xfs/xfs_inode.c
>> >> @@ -40,9 +40,18 @@ kmem_zone_t *xfs_inode_zone;
>> >>
>> >>  /*
>> >>   * Used in xfs_itruncate_extents().  This is the maximum number of extents
>> >> - * freed from a file in a single transaction.
>> >> + * we will unmap and defer for freeing in a single call to xfs_bunmapi().
>> >> + * Realtime inodes directly free extents in xfs_bunmapi(), so are bound
>> >> + * by transaction reservation size to 2 extents.
>> >>   */
>> >> -#define	XFS_ITRUNC_MAX_EXTENTS	2
>> >> +static inline int
>> >> +xfs_itrunc_max_extents(
>> >> +	struct xfs_inode	*ip)
>> >> +{
>> >> +	if (XFS_IS_REALTIME_INODE(ip))
>> >> +		return 2;
>> >> +	return 64;
>> >> +}
>> >>
>> >>  STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
>> >>  STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
>> >> @@ -1402,7 +1411,7 @@ xfs_itruncate_extents_flags(
>> >>  	while (unmap_len > 0) {
>> >>  		ASSERT(tp->t_firstblock == NULLFSBLOCK);
>> >>  		error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len,
>> >> -				flags, XFS_ITRUNC_MAX_EXTENTS);
>> >> +				flags, xfs_itrunc_max_extents(ip));
>> >>  		if (error)
>> >>  			goto out;
>> >
>> > The list of free extent items at xfs_defer_pending->dfp_work could
>> > now contain XFS_EFI_MAX_FAST_EXTENTS (i.e. 16) entries in the worst case.
>
> Yes, but we do exactly this when freeing a large fragmented directly
> block. That is, we ask xfs_bunmapi to unmap a 64kB range regardless
> of how many extents map that range.
>
> IOWs, the limitation in extent count placed in
> xfs_itruncate_extents() doesn't actually address the underlying
> problem - it's just a band-aid that has been placed over the easy to
> trigger transaction overrun symptom that has always been present in
> the underlying extent freeing code.
>
>> > For a single transaction, xfs_calc_itruncate_reservation() reserves space for
>> > logging only 4 extents (i.e. 4 exts * 2 trees * (2 * max depth - 1) * block
>> > size).
>>
>> ... Sorry, I meant to say "xfs_calc_itruncate_reservation() reserves log space
>> required for freeing 4 extents ..."
>
> My point exactly - the code has always had a mismatch between
> reservations and what we can stuff into an EFI.  EFIs are unbound in
> size, so having a fixed "4 extents per EFI" reservation limit has
> never made any real sense given that unmaping a 64kB directory block
> on a 1kb block size filesystem has a worst case of freeing 64
> extents in a single transaction. As I said above, the limitations
> placed on xfs_itruncate_extents is largely a hack because it doesn't
> address other avenues to the same overruns...
>

Ok. Thanks for the explanation.

> This "4 extents per transaction" reservation makes even less sense
> now that extents are freed by defer ops, not by the transaction that
> unmaps them. We've completely decoupled extent freeing from the
> higher level code that unmaps them, and so now the freeing
> transactions are independent of the high level code that runs the
> freeing operations.
>
> IOWs, having a reservation big enough to free a single extent is all
> we should need now as we can break the extent freeing up into
> individual transactions and still have them complete atomically even
> after a crash. That's where I'm trying to get to here, but it's
> clear that I need to refine the code a bit further such that we only
> allow individual EFIs to queue up and free multiple extents within
> an AG to be freed in the same transaction....

Ok. That should limit the number of CNTBT/BNOBT blocks being logged by a
transaction which is freeing extents.

>
> FWIW, I suspect that the right thing to do here is make use of the
> xfs_defer_finish_one() mechanism for relogging the remaining intents
> in work list while it is processing that list. All we need is for
> xfs_extent_free_finish_item() to return -EAGAIN instead of running
> the free transaction, and we will log the completed EFDs and the
> remaining EFIs to be run and roll the transaction.
>
> Hence we can control exactly when we roll the extent freeing
> transaction (e.g. when the next extent to free is in a different AG
> to the one we've just been freeing extents in) and as a result we
> can define a fixed transaction reservation for extent freeing that
> works in every situation without having to care about arbitrary
> "how many extents can we unmap without overrun" concerns.
>

That sounds perfect.

> That's the goal here - fixed, small reservation for extent freeing
> that is decoupled from and independent of the number of extents that
> need to be freed atomically by the high level operation...
>
> Hence I think that a minor amount of rework will allow the EFI code
> to log large numbers of extents to be freed whilst still processing
> them within single AG free space tree modification reservation
> bounds...

--
chandan