On 2024/7/1 9:16, Dave Chinner wrote: > On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> When unaligned truncate down a big realtime file, xfs_truncate_page() >> only zeros out the tail EOF block, __xfs_bunmapi() should split the tail >> written extent and convert the later one that beyond EOF block to >> unwritten, but it couldn't work as expected now since the reserved block >> is zero in xfs_setattr_size(), this could expose stale data just after >> commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write >> operation")'. >> >> If we truncate file that contains a large enough written extent: >> >> |< rxext >|< rtext >| >> ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW >> ^ (new EOF) ^ old EOF >> >> Since we only zeros out the tail of the EOF block, and >> xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned >> extents, it becomes this state: >> >> |< rxext >| >> ...WWWzWWWWWWWWWWWWW >> ^ new EOF >> >> Then if we do an extending write like this, the blocks in the previous >> tail extent becomes stale: >> >> |< rxext >| >> ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW >> ^ old EOF ^ append start ^ new EOF >> >> Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime >> inode. > > This same problem is going to happen with force aligned allocations, > right? i.e. it is a result of having a allocation block size larger > than one filesystem block? > Yeah, right. +cc John > If so, then.... > >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >> Reviewed-by: Christoph Hellwig <hch@xxxxxx> >> --- >> fs/xfs/xfs_iops.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c >> index ff222827e550..a00dcbc77e12 100644 >> --- a/fs/xfs/xfs_iops.c >> +++ b/fs/xfs/xfs_iops.c >> @@ -17,6 +17,8 @@ >> #include "xfs_da_btree.h" >> #include "xfs_attr.h" >> #include "xfs_trans.h" >> +#include "xfs_trans_space.h" >> +#include "xfs_bmap_btree.h" >> #include "xfs_trace.h" >> #include "xfs_icache.h" >> #include "xfs_symlink.h" >> @@ -811,6 +813,7 @@ xfs_setattr_size( >> struct xfs_trans *tp; >> int error; >> uint lock_flags = 0; >> + uint resblks = 0; >> bool did_zeroing = false; >> >> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); >> @@ -917,7 +920,17 @@ xfs_setattr_size( >> return error; >> } >> >> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); >> + /* >> + * For realtime inode with more than one block rtextsize, we need the >> + * block reservation for bmap btree block allocations/splits that can >> + * happen since it could split the tail written extent and convert the >> + * right beyond EOF one to unwritten. >> + */ >> + if (xfs_inode_has_bigrtalloc(ip)) >> + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > .... should this be doing this generic check instead: > > if (xfs_inode_alloc_unitsize(ip) > 1) if (xfs_inode_alloc_unitsize(ip) > i_blocksize(inode)) ? > resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > Yeah, it makes sense to me, but Christoph suggested to think about force aligned allocations later, so I only dealt with the big RT inode case here. I can revise it if John and Christoph don't object. Thanks, Yi.