On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote: > From: "Darrick J. Wong" <djwong@xxxxxxxxxx> > > The existing extsize hint code already did the work of expanding file > range mapping requests so that the range is aligned to the hint value. > Now add the code we need to guarantee that the space allocations are > also always aligned. > > XXX: still need to check all this with reflink > > Signed-off-by: "Darrick J. Wong" <djwong@xxxxxxxxxx> > Co-developed-by: John Garry <john.g.garry@xxxxxxxxxx> > Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++----- > fs/xfs/xfs_iomap.c | 4 +++- > 2 files changed, 20 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index 60d100134280..8dee60795cf4 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments( > align = xfs_get_cowextsz_hint(ap->ip); > else if (ap->datatype & XFS_ALLOC_USERDATA) > align = xfs_get_extsz_hint(ap->ip); > + > + /* > + * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is > + * set as forcealign and cowextsz_hint are mutually exclusive > + */ > + if (xfs_inode_forcealign(ap->ip) && align) { > + args->alignment = align; > + if (stripe_align % align) > + stripe_align = align; This kinda does the right thing, but it strikes me as the wrong thing to be doing - it seems to conflates two different physical alignment properties in potentially bad ways, and we should never get this far into the allocator to discover these issues. Stripe alignment is alignment to physical disks in a RAID array. Inode forced alignment is file offset alignment to specificly defined LBA address ranges. The stripe unit/width is set at mkfs time, the inode extent size hint at runtime. These can only work together in specific situations: 1. max sized RWF_ATOMIC IO must be the same or smaller size than the stripe unit. Alternatively: stripe unit must be an integer (power of 2?) multiple of max atomic IO size. IOWs, stripe unit vs atomic IO constraints must be resolved in a compatible manner at mkfs time. If they can't be resolved (e.g. non-power of 2 stripe unit) then atomic IO cannot be done on the filesystem at all. Stripe unit constraints always win over atomic IO. 2. min sized RWF_ATOMIC IO must be an integer divider of stripe unit so that small atomic IOs are always correctly aligned to the stripe unit. 3. Inode forced alignment constraints set at runtime (i.e. extsize hints) must fit within the above stripe unit vs min/max atomic IO requirements. i.e. extent size hint will typically need to an integer multiple of stripe unit size which will always be compatible with stripe unit and atomic IO size and alignments... IOWs, these are static geometry constraints and so should be checked and rejected at the point where alignments are specified (i.e. mkfs, mount and ioctls). Then the allocator can simply assume that forced inode alignments are always stripe alignment compatible and we don't need separate handling of two possibly incompatible alignments. Regardless, I think the code as it stands won't work correctly when a stripe unit is not set. The correct functioning of this code appears to be dependent on stripe_align being set >= the extent size hint. If stripe align is not set, then what does (0 % align) return? If my checks are correct, this will return 0, and so the above code will end up with stripe_align = 0. Now, consider that args->alignment is used to signal to the allocator what the -start alignment- of the allocation is supposed to be, whilst args->prod/args->mod are used to tell the allocator what the tail adjustment is supposed to be. Stripe alignment wants start alignment, extent size hints want tail adjustment and force aligned allocation wants both start alignment and tail adjustment. However, the allocator overrides these depending on what it is doing. e.g. exact block EOF allocation turns off start alignment but has to ensure space is reserved for start alignment if it fails. This reserves space for stripe_align in args->minalignslop, but if force alignment has a start alignment requirement larger than stripe unit alignment, this code fails to reserve the necessary alignment slop for the force alignment code. If we aren't doing exact block EOF allocation, then we do stripe unit alignment. Again, if this fails and the forced alignment is larger than stripe alignment, this code does not reserve space for the larger alignment in args->minalignslop, so it will also do the wrong when we fall back to an args->alignment > 1 allocation. Failing to correctly account for minalignslop is known to cause accounting problems when the AG is very near empty, and the usual result of that is cancelling of a dirty allocation transaction and a forced shutdown. So this is something we need to get right. More below.... > + } else { > + args->alignment = 1; > + } Just initialise the allocation args structure with a value of 1 like we already do? > + > if (align) { > if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0, > ap->eof, 0, ap->conv, &ap->offset, > @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc( > args.minlen = args.maxlen = ap->minlen; > args.total = ap->total; > > - args.alignment = 1; > args.minalignslop = 0; This likely breaks the debug code. This isn't intended for testing atomic write capability, it's for exercising low space allocation algorithms with worst case fragmentation patterns. This is only called when error injection is turned on to trigger this path, so we shouldn't need to support forced IO alignment here. > > args.minleft = ap->minleft; > @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof( > { > struct xfs_mount *mp = args->mp; > struct xfs_perag *caller_pag = args->pag; > + int orig_alignment = args->alignment; > int error; > > /* > @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof( > > /* > * Allocation failed, so turn return the allocation args to their > - * original non-aligned state so the caller can proceed on allocation > - * failure as if this function was never called. > + * original state so the caller can proceed on allocation failure as > + * if this function was never called. > */ > - args->alignment = 1; > + args->alignment = orig_alignment; > return 0; > } As I said above, we can't set an alignment of > 1 here if we haven't accounted for that alignment in args->minalignslop above. This leads to unexpected ENOSPC conditions and filesystem shutdowns. I suspect what we need to do is get rid of the separate stripe_align variable altogether and always just set args->alignment to what we need the extent start alignment to be, regardless of whether it is from stripe alignment or forced alignment. Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what 'stripe_align' is - the exact EOF block allocation can simply save and restore the args->alignment value and use it for minalignslop calculations for the initial exact block allocation. Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign() is true, we can abort allocation immediately, and not bother to fall back on further aligned/unaligned attempts that will also fail or do the wrong them. Similarly, if we aren't doing EOF allocation, having args->alignment set means it will do the right thing for the first allocation attempt. Again, if that fails, we can check if xfs_inode_forcealign() is true and fail the aligned allocation instead of running the low space algorithm. This now makes it clear that we're failing the allocation because of the forced alignment requirement, and now the low space allocation code can explicitly turn off start alignment as it isn't required... > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 18c8f168b153..70fe873951f3 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -181,7 +181,9 @@ xfs_eof_alignment( > * If mounted with the "-o swalloc" option the alignment is > * increased from the strip unit size to the stripe width. > */ > - if (mp->m_swidth && xfs_has_swalloc(mp)) > + if (xfs_inode_forcealign(ip)) > + align = xfs_get_extsz_hint(ip); > + else if (mp->m_swidth && xfs_has_swalloc(mp)) > align = mp->m_swidth; > else if (mp->m_dalign) > align = mp->m_dalign; This doesn't work for files with a current size of less than "align". The next line of code does: if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align)) align = 0; IOWs, the first aligned write allocation will occur with a file size of zero, and that will result in this function returning 0. i.e. speculative post EOF delalloc doesn't turn on and align the EOF until writes up to offset == align have already been completed. Essentially, this code wants to be: if (xfs_inode_forcealign(ip)) return xfs_get_extsz_hint(ip); So that any write to the a force aligned inode will always trigger extent size hint EOF alignment. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx