Re: [PATCH v4 00/14] forcealign for xfs

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 16 Sep 2024 15:25:06 +1000

On Mon, Sep 09, 2024 at 05:18:43PM +0100, John Garry wrote:
> > > > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
> > > > into account for the post-eof block removal, but doesn't change
> > > > xfs_free_eofblocks() at all. i.e  it also relies on
> > > > xfs_itruncate_extents_flags() to do the right thing for force
> > > > aligned inodes.
> > > 
> > > What state should the blocks post-EOF blocks be? A simple example of
> > > partially truncating an alloc unit is:
> > > 
> > > $xfs_io -c "extsize" mnt/file
> > > [16384] mnt/file
> > > 
> > > 
> > > $xfs_bmap -vvp mnt/file
> > > mnt/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >     0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> > > 
> > > 
> > > $truncate -s 10461184 mnt/file # 10M - 6FSB
> > > 
> > > $xfs_bmap -vvp mnt/file
> > > mnt/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >     0: [0..20431]:      192..20623        0 (192..20623)     20432 000000
> > >     1: [20432..20447]:  20624..20639      0 (20624..20639)      16 010000
> > >   FLAG Values:
> > >      0010000 Unwritten preallocated extent
> > > 
> > > Is that incorrect state?
> > 
> > Think about it: what happens if you now truncate it back up to 10MB
> > (i.e. aligned length) and then do an aligned atomic write on it.
> > 
> > First: What happens when you truncate up?
> > 
> > ......
> > 
> > Yes, iomap_zero_range() will see the unwritten extent and skip it.
> > i.e. The unwritten extent stays as an unwritten extent, it's now
> > within EOF. That written->unwritten extent boundary is not on an
> > aligned file offset.
> 
> Right
> 
> > 
> > Second: What happens when you do a correctly aligned atomic write
> > that spans this range now?
> > 
> > ......
> > 
> > Iomap only maps a single extent at a time, so it will only map the
> > written range from the start of the IO (aligned) to the start of the
> > unwritten extent (unaligned).  Hence the atomic write will be
> > rejected because we can't do the atomic write to such an unaligned
> > extent.
> 
> It was being considered to change this handling for atomic writes. More
> below at *.

I don't think that this is something specific to atomic writes -
forced alignment means -alignment is guaranteed- regardless of what
ends up using it.

Yes, we can track unwritten extents on an -unaligned- boundary, but
that doesn't mean that we should allow it when we are trying to
guarantee logical and physical alignment of the file offset and
extent boundaries. i.e. The definition of forced alignment behaviour
is that all file offsets and extents in the file are aligned to the
same alignment.

I don't see an exception that allows for unaligned unwritten
extents in that definition.

> > That's not a bug in the atomic write path - this failure occurs
> > because of the truncate behaviour doing post-eof unwritten extent
> > conversion....
> > 
> > Yes, I agree that the entire -physical- extent is still correctly
> > aligned on disk so you could argue that the unwritten conversion
> > that xfs_bunmapi_range is doing is valid forced alignment behaviour.
> > However, the fact is that breaking the aligned physical extent into
> > two unaligned contiguous extents in different states in the BMBT
> > means that they are treated as two seperate unaligned extents, not
> > one contiguous aligned physical extent.
> 
> Right, this is problematic.
> 
> * I guess that you had not been following the recent discussion on this
> topic in the latest xfs atomic writes series @ https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@xxxxxxxxxx/
> and also mentioned earlier in
> https://lore.kernel.org/linux-xfs/20240726171358.GA27612@xxxxxx/
> 
> There I dropped the sub-alloc unit zeroing. The concept to iter for a single
> bio seems sane, but as Darrick mentioned, we have issue of non-atomically
> committing all the extent conversions.

Yes, I understand these problems exist.  My entire point is that the
forced alignment implemention should never allow such unaligned
extent patterns to be created in the first place. If we avoid
creating such situations in the first place, then we never have to
care about about unaligned unwritten extent conversion breaking
atomic IO.

FWIW, I also understand things are different if we are doing 128kB
atomic writes on 16kB force aligned files. However, in this
situation we are treating the 128kB atomic IO as eight individual
16kB atomic IOs that are physically contiguous. Hence in this
situation it doesn't matter if we have a mix of 16kB aligned
written/unwritten/hole extents as each 16kB chunks is independent of
the others.

What matters is that each indivudal 16kB chunk shows either the old
data or the new data - we are not guaranteeing that the entire 128kB
write is atomic. Hence in this situation we can both submit and
process each 16kB shunk as independent IOs with independent IO
compeltion transactions. All that matters is that we don't signal
completion to userspace until all the IO is complete, and we already
do that for fragmented DIO writes...

> > Again, this is different to the traditional RT file behaviour - it
> > can use unwritten extents for sub-alloc-unit alignment unmaps
> > because the RT device can align file offset to any physical offset,
> > and issue unaligned sector sized IO without any restrictions. Forced
> > alignment does not have this freedom, and when we extend forced
> > alignment to RT files, it will not have the freedom to use
> > unwritten extents for sub-alloc-unit unmapping, either.
> > 
> So how do you think that we should actually implement
> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be
> like:
> 
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
>                 WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
>                 return 0;
>         }
> +	if (xfs_inode_has_forcealign(ip))
> +	       first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
> first_unmap_block);
>         error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,

Yes, it would be something like that, except it would have to be
done before first_unmap_block is verified.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx