AFAICS, forcealign behaviour is same as RT, so then this would be a mainline
bug, right?
No, I don't think so. I think this is a case where forcealign and RT
behaviours differ, primarily because RT doesn't actually care about
file offset -> physical offset alignment and can do unaligned IO
whenever it wants. Hence having an unaligned written->unwritten
extent state transition doesnt' affect anything for rt files...
ok
We change the space reservation in xfs-setattr_size() for this case
(patch 9) but then don't do any alignment there - it relies on
xfs_itruncate_extents_flags() to do the right thing w.r.t. extent
removal alignment w.r.t. the new EOF.
i.e. The xfs_setattr_size() code takes care of EOF block zeroing and
page cache removal so the user doesn't see old data beyond EOF,
whilst xfs_itruncate_extents_flags() is supposed to take care of the
extent removal and the details of that operation (e.g. alignment).
So we should roundup the unmap block to the alloc unit, correct? I have my
doubts about that, and thought that xfs_bunmapi_range() takes care of any
alignment handling.
The start should round up, yes. And, no, xfs_bunmapi_range() isn't
specifically "taking care" of alignment. It's just marking a partial
alloc unit range as unwritten, which means we now have -unaligned
extents- in the BMBT.
Yes, I have noticed this previously.
Patch 10 also modifies xfs_can_free_eofblocks() to take alignment
into account for the post-eof block removal, but doesn't change
xfs_free_eofblocks() at all. i.e it also relies on
xfs_itruncate_extents_flags() to do the right thing for force
aligned inodes.
What state should the blocks post-EOF blocks be? A simple example of
partially truncating an alloc unit is:
$xfs_io -c "extsize" mnt/file
[16384] mnt/file
$xfs_bmap -vvp mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20479]: 192..20671 0 (192..20671) 20480 000000
$truncate -s 10461184 mnt/file # 10M - 6FSB
$xfs_bmap -vvp mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20431]: 192..20623 0 (192..20623) 20432 000000
1: [20432..20447]: 20624..20639 0 (20624..20639) 16 010000
FLAG Values:
0010000 Unwritten preallocated extent
Is that incorrect state?
Think about it: what happens if you now truncate it back up to 10MB
(i.e. aligned length) and then do an aligned atomic write on it.
First: What happens when you truncate up?
......
Yes, iomap_zero_range() will see the unwritten extent and skip it.
i.e. The unwritten extent stays as an unwritten extent, it's now
within EOF. That written->unwritten extent boundary is not on an
aligned file offset.
Right
Second: What happens when you do a correctly aligned atomic write
that spans this range now?
......
Iomap only maps a single extent at a time, so it will only map the
written range from the start of the IO (aligned) to the start of the
unwritten extent (unaligned). Hence the atomic write will be
rejected because we can't do the atomic write to such an unaligned
extent.
It was being considered to change this handling for atomic writes. More
below at *.
That's not a bug in the atomic write path - this failure occurs
because of the truncate behaviour doing post-eof unwritten extent
conversion....
Yes, I agree that the entire -physical- extent is still correctly
aligned on disk so you could argue that the unwritten conversion
that xfs_bunmapi_range is doing is valid forced alignment behaviour.
However, the fact is that breaking the aligned physical extent into
two unaligned contiguous extents in different states in the BMBT
means that they are treated as two seperate unaligned extents, not
one contiguous aligned physical extent.
Right, this is problematic.
* I guess that you had not been following the recent discussion on this
topic in the latest xfs atomic writes series @
https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@xxxxxxxxxx/
and also mentioned earlier in
https://lore.kernel.org/linux-xfs/20240726171358.GA27612@xxxxxx/
There I dropped the sub-alloc unit zeroing. The concept to iter for a
single bio seems sane, but as Darrick mentioned, we have issue of
non-atomically committing all the extent conversions.
FWIW, this is how I think that the change according to Darrick's idea
would look:
---->8----
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc92..ec76d6a8750d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -244,7 +244,8 @@ xfs_iomap_write_direct(
xfs_fileoff_t count_fsb,
unsigned int flags,
struct xfs_bmbt_irec *imap,
- u64 *seq)
+ u64 *seq,
+ bool zeroing)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
@@ -285,7 +286,7 @@ xfs_iomap_write_direct(
* the reserve block pool for bmbt block allocation if there is no space
* left but we need to do unwritten extent conversion.
*/
+ if (zeroing)
+ bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
if (flags & IOMAP_DAX) {
bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
if (imap->br_state == XFS_EXT_UNWRITTEN) {
force = true;
@@ -780,6 +781,11 @@ xfs_direct_write_iomap_begin(
u16 iomap_flags = 0;
unsigned int lockmode;
u64 seq;
+ bool atomic = flags & IOMAP_ATOMIC;
+ struct xfs_bmbt_irec imap2[XFS_BMAP_MAX_NMAP];
+ xfs_fileoff_t _offset_fsb = xfs_inode_rounddown_alloc_unit(ip,
offset_fsb);
+ xfs_fileoff_t _end_fsb = xfs_inode_roundup_alloc_unit(ip, end_fsb);
+ int loop_index;
ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
@@ -843,6 +849,9 @@ xfs_direct_write_iomap_begin(
if (imap_needs_alloc(inode, flags, &imap, nimaps))
goto allocate_blocks;
+ if (atomic && !imap_spans_range(&imap, offset_fsb, end_fsb))
+ goto out_atomic;
+
/*
* NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
* a single map so that we avoid partial IO failures due to the rest of
@@ -897,7 +906,7 @@ xfs_direct_write_iomap_begin(
xfs_iunlock(ip, lockmode);
error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
- flags, &imap, &seq);
+ flags, &imap, &seq, false);
if (error)
return error;
@@ -918,6 +927,28 @@ xfs_direct_write_iomap_begin(
xfs_iunlock(ip, lockmode);
return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
+out_atomic:
+ nimaps = XFS_BMAP_MAX_NMAP;
+
+ error = xfs_bmapi_read(ip, _offset_fsb, _end_fsb - _offset_fsb, &imap2[0],
+ &nimaps, 0);
+ for (loop_index = 0; loop_index < nimaps; loop_index++) {
+ struct xfs_bmbt_irec *_imap2 = &imap2[loop_index];
+
+ if (_imap2->br_state == XFS_EXT_UNWRITTEN) {
+
+ xfs_iunlock(ip, lockmode);
+
+ error = xfs_iomap_write_direct(ip, _imap2->br_startoff,
_imap2->br_blockcount,
+ flags, &imap, &seq, true);
+ if (error)
+ return error;
+ goto relock;
+ }
+ }
+ /* We should not reach this, but TODO better handling */
+ error = -EINVAL;
+
out_unlock:
if (lockmode)
xfs_iunlock(ip, lockmode);
-----8<----
But I have my doubts about using XFS_BMAPI_ZERO, even if it is just for
pre-zeroing unwritten parts of the alloc unit for an atomic write.
This extent state discontiunity is results in breaking physical IO
across the extent state boundary. Hence such an unaligned extent
state change violates the physical IO alignment guarantees that
forced alignment is supposed to provide atomic writes...
Yes
This is the reason why operations like hole punching round to extent
size for forced alignment at the highest level. i.e. We cannot allow
xfs_bunmapi_range() to "take care" of alignment handling because
managing partial extent unmappings with sub-alloc-unit unwritten
extents does not work for forced alignment.
Again, this is different to the traditional RT file behaviour - it
can use unwritten extents for sub-alloc-unit alignment unmaps
because the RT device can align file offset to any physical offset,
and issue unaligned sector sized IO without any restrictions. Forced
alignment does not have this freedom, and when we extend forced
alignment to RT files, it will not have the freedom to use
unwritten extents for sub-alloc-unit unmapping, either.
So how do you think that we should actually implement
xfs_itruncate_extents_flags() properly for forcealign? Would it simply
be like:
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags(
WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
return 0;
}
+ if (xfs_inode_has_forcealign(ip))
+ first_unmap_block = xfs_inode_roundup_alloc_unit(ip,
first_unmap_block);
error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
Thanks,
John