On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@xxxxxxxxxx> > > When unaligned truncate down a big realtime file, xfs_truncate_page() > only zeros out the tail EOF block, __xfs_bunmapi() should split the tail > written extent and convert the later one that beyond EOF block to > unwritten, but it couldn't work as expected now since the reserved block > is zero in xfs_setattr_size(), this could expose stale data just after > commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write > operation")'. > > If we truncate file that contains a large enough written extent: > > |< rxext >|< rtext >| > ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW > ^ (new EOF) ^ old EOF > > Since we only zeros out the tail of the EOF block, and > xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned > extents, it becomes this state: > > |< rxext >| > ...WWWzWWWWWWWWWWWWW > ^ new EOF > > Then if we do an extending write like this, the blocks in the previous > tail extent becomes stale: > > |< rxext >| > ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW > ^ old EOF ^ append start ^ new EOF > > Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime > inode. This same problem is going to happen with force aligned allocations, right? i.e. it is a result of having a allocation block size larger than one filesystem block? If so, then.... > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > --- > fs/xfs/xfs_iops.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index ff222827e550..a00dcbc77e12 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -17,6 +17,8 @@ > #include "xfs_da_btree.h" > #include "xfs_attr.h" > #include "xfs_trans.h" > +#include "xfs_trans_space.h" > +#include "xfs_bmap_btree.h" > #include "xfs_trace.h" > #include "xfs_icache.h" > #include "xfs_symlink.h" > @@ -811,6 +813,7 @@ xfs_setattr_size( > struct xfs_trans *tp; > int error; > uint lock_flags = 0; > + uint resblks = 0; > bool did_zeroing = false; > > xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); > @@ -917,7 +920,17 @@ xfs_setattr_size( > return error; > } > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > + /* > + * For realtime inode with more than one block rtextsize, we need the > + * block reservation for bmap btree block allocations/splits that can > + * happen since it could split the tail written extent and convert the > + * right beyond EOF one to unwritten. > + */ > + if (xfs_inode_has_bigrtalloc(ip)) > + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); .... should this be doing this generic check instead: if (xfs_inode_alloc_unitsize(ip) > 1) resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); -Dave. -- Dave Chinner david@xxxxxxxxxxxxx