Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 5 Mar 2024 11:44:02 +1100

On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@xxxxxxxxxx>
> 
> The existing extsize hint code already did the work of expanding file
> range mapping requests so that the range is aligned to the hint value.
> Now add the code we need to guarantee that the space allocations are
> also always aligned.
> 
> XXX: still need to check all this with reflink
> 
> Signed-off-by: "Darrick J. Wong" <djwong@xxxxxxxxxx>
> Co-developed-by: John Garry <john.g.garry@xxxxxxxxxx>
> Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>  fs/xfs/xfs_iomap.c       |  4 +++-
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 60d100134280..8dee60795cf4 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>  		align = xfs_get_cowextsz_hint(ap->ip);
>  	else if (ap->datatype & XFS_ALLOC_USERDATA)
>  		align = xfs_get_extsz_hint(ap->ip);
> +
> +	/*
> +	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
> +	 * set as forcealign and cowextsz_hint are mutually exclusive
> +	 */
> +	if (xfs_inode_forcealign(ap->ip) && align) {
> +		args->alignment = align;
> +		if (stripe_align % align)
> +			stripe_align = align;

This kinda does the right thing, but it strikes me as the wrong
thing to be doing - it seems to conflates two different physical
alignment properties in potentially bad ways, and we should never
get this far into the allocator to discover these issues.

Stripe alignment is alignment to physical disks in a RAID array.
Inode forced alignment is file offset alignment to specificly
defined LBA address ranges.  The stripe unit/width is set at mkfs
time, the inode extent size hint at runtime.

These can only work together in specific situations:

	1. max sized RWF_ATOMIC IO must be the same or smaller size
	   than the stripe unit. Alternatively: stripe unit must be
	   an integer (power of 2?) multiple of max atomic IO size.

	   IOWs, stripe unit vs atomic IO constraints must be
	   resolved in a compatible manner at mkfs time. If they
	   can't be resolved (e.g. non-power of 2 stripe unit) then
	   atomic IO cannot be done on the filesystem at all. Stripe
	   unit constraints always win over atomic IO.

	2. min sized RWF_ATOMIC IO must be an integer divider of
	   stripe unit so that small atomic IOs are always correctly
	   aligned to the stripe unit.

	3. Inode forced alignment constraints set at runtime (i.e.
	   extsize hints) must fit within the above stripe unit vs
	   min/max atomic IO requirements.

	   i.e. extent size hint will typically need to an integer
	   multiple of stripe unit size which will always be
	   compatible with stripe unit and atomic IO size and
	   alignments...

IOWs, these are static geometry constraints and so should be checked
and rejected at the point where alignments are specified (i.e.
mkfs, mount and ioctls). Then the allocator can simply assume that
forced inode alignments are always stripe alignment compatible and
we don't need separate handling of two possibly incompatible
alignments.

Regardless, I think the code as it stands won't work correctly when
a stripe unit is not set.

The correct functioning of this code appears to be dependent on
stripe_align being set >= the extent size hint. If stripe align is
not set, then what does (0 % align) return? If my checks are
correct, this will return 0, and so the above code will end up with
stripe_align = 0.

Now, consider that args->alignment is used to signal to the
allocator what the -start alignment- of the allocation is supposed
to be, whilst args->prod/args->mod are used to tell the allocator
what the tail adjustment is supposed to be.

Stripe alignment wants start alignment, extent size hints want tail
adjustment and force aligned allocation wants both start alignment
and tail adjustment.

However, the allocator overrides these depending on what it is
doing. e.g. exact block EOF allocation turns off start alignment but
has to ensure space is reserved for start alignment if it fails.
This reserves space for stripe_align in args->minalignslop, but if
force alignment has a start alignment requirement larger than stripe
unit alignment, this code fails to reserve the necessary alignment
slop for the force alignment code.

If we aren't doing exact block EOF allocation, then we do stripe
unit alignment. Again, if this fails and the forced alignment is
larger than stripe alignment, this code does not reserve space for
the larger alignment in args->minalignslop, so it will also do the
wrong when we fall back to an args->alignment > 1 allocation.

Failing to correctly account for minalignslop is known to cause
accounting problems when the AG is very near empty, and the usual
result of that is cancelling of a dirty allocation transaction and a
forced shutdown. So this is something we need to get right.

More below....

> +	} else {
> +		args->alignment = 1;
> +	}

Just initialise the allocation args structure with a value of 1 like
we already do?

> +
>  	if (align) {
>  		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>  					ap->eof, 0, ap->conv, &ap->offset,
> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>  	args.minlen = args.maxlen = ap->minlen;
>  	args.total = ap->total;
>  
> -	args.alignment = 1;
>  	args.minalignslop = 0;

This likely breaks the debug code. This isn't intended for testing
atomic write capability, it's for exercising low space allocation
algorithms with worst case fragmentation patterns. This is only
called when error injection is turned on to trigger this path, so we
shouldn't need to support forced IO alignment here.

>  
>  	args.minleft = ap->minleft;
> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	struct xfs_perag	*caller_pag = args->pag;
> +	int			orig_alignment = args->alignment;
>  	int			error;
>  
>  	/*
> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>  
>  	/*
>  	 * Allocation failed, so turn return the allocation args to their
> -	 * original non-aligned state so the caller can proceed on allocation
> -	 * failure as if this function was never called.
> +	 * original state so the caller can proceed on allocation failure as
> +	 * if this function was never called.
>  	 */
> -	args->alignment = 1;
> +	args->alignment = orig_alignment;
>  	return 0;
>  }

As I said above, we can't set an alignment of > 1 here if we haven't
accounted for that alignment in args->minalignslop above. This leads
to unexpected ENOSPC conditions and filesystem shutdowns.

I suspect what we need to do is get rid of the separate stripe_align
variable altogether and always just set args->alignment to what we
need the extent start alignment to be, regardless of whether it is
from stripe alignment or forced alignment.

Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
'stripe_align' is - the exact EOF block allocation can simply save
and restore the args->alignment value and use it for minalignslop
calculations for the initial exact block allocation.

Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
is true, we can abort allocation immediately, and not bother to fall
back on further aligned/unaligned attempts that will also fail or do
the wrong them.

Similarly, if we aren't doing EOF allocation, having args->alignment
set means it will do the right thing for the first allocation
attempt. Again, if that fails, we can check if
xfs_inode_forcealign() is true and fail the aligned allocation
instead of running the low space algorithm. This now makes it clear
that we're failing the allocation because of the forced alignment
requirement, and now the low space allocation code can explicitly
turn off start alignment as it isn't required...

> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..70fe873951f3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>  		 * If mounted with the "-o swalloc" option the alignment is
>  		 * increased from the strip unit size to the stripe width.
>  		 */
> -		if (mp->m_swidth && xfs_has_swalloc(mp))
> +		if (xfs_inode_forcealign(ip))
> +			align = xfs_get_extsz_hint(ip);
> +		else if (mp->m_swidth && xfs_has_swalloc(mp))
>  			align = mp->m_swidth;
>  		else if (mp->m_dalign)
>  			align = mp->m_dalign;

This doesn't work for files with a current size of less than
"align". The next line of code does:

	if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
		align = 0;

IOWs, the first aligned write allocation will occur with a file size
of zero, and that will result in this function returning 0. i.e.
speculative post EOF delalloc doesn't turn on and align the EOF
until writes up to offset == align have already been completed.

Essentially, this code wants to be:

	if (xfs_inode_forcealign(ip))
		return xfs_get_extsz_hint(ip);

So that any write to the a force aligned inode will always trigger
extent size hint EOF alignment.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx