On Sat, May 11, 2024 at 11:11:32AM +0800, Zhang Yi wrote: > On 2024/5/8 17:01, Chandan Babu R wrote: > > Hi, > > > > generic/561 fails when testing XFS on a next-20240506 kernel as shown below, > > > > # ./check generic/561 > > FSTYP -- xfs (debug) > > PLATFORM -- Linux/x86_64 xfs-crc-rtdev-extsize-28k 6.9.0-rc7-next-20240506+ #1 SMP PREEMPT_DYNAMIC Mon May 6 07:53:46 GMT 2024 > > MKFS_OPTIONS -- -f -rrtdev=/dev/loop14 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/loop5 > > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 -ortdev=/dev/loop14 /dev/loop5 /media/scratch > > > > generic/561 - output mismatch (see /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad) > > --- tests/generic/561.out 2024-05-06 08:18:09.681430366 +0000 > > +++ /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad 2024-05-08 09:14:24.908010133 +0000 > > @@ -1,2 +1,5 @@ > > QA output created by 561 > > +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d895/d1307XXX/d8a4/d832XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r112fXXXXXXXXXXX: FAILED > > +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d13a3XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d13c0XXXXXXXX/d2301X/d222bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1240XXXXXXXXXXXXXXXXXXXXXXXX/d722XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1380XXXXXXXXXXXXXXXX/dc62XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r10d5: FAILED > > +md5sum: WARNING: 2 computed checksums did NOT match > > Silence is golden > > ... > > (Run 'diff -u /var/lib/xfstests/tests/generic/561.out /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad' to see the entire diff) > > Ran: generic/561 > > Failures: generic/561 > > Failed 1 of 1 tests > > > > Sorry about this regression. After debuging and analyzing the code, I notice > that this problem could only happens on xfs realtime inode. The real problem > is about realtime extent alignment. > > Please assume that if we have a file that contains a written extent [A, D). > We unaligned truncate to the file to B, in the middle of this written extent. > > A B D > +wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww > > After the truncate, the i_size is set to B, but due to the sb_rextsize, > xfs_itruncate_extents() truncate and aligned the written extent to C, so the > data in [B, C) doesn't zeroed and becomes stale. > > A B C > +wwwwwwwwwwwwwwSSSSSS > ^ > EOF This region must be zeroed on disk before we call xfs_itruncate_extents(). i.e completed xfs_setattr_size() via xfs_truncate_page() and flushed to disk before we start removing extents. The problem is that iomap_truncate_page() only zeros the trailing portion of the i_blocksize() value, which is wrong for realtime devices with rtextsize != fs blocksize. Further, xfs_setattr_size() then calls truncate_setsize(newsize) before the zeroing has been written back to disk, which means that the flush that occurs immediately after the truncate_setsize() call can not write blocks beyond the new EOF regardless of whether iomap_truncate_page() wrote zeroes to them or not. > The if we write [E, F) beyond this written extent, xfs_file_write_checks()-> > xfs_zero_range() would zero [B, C) in page cache, but since we don't increase > i_size in iomap_zero_iter(), the writeback process doesn't write zero data > to disk. After write, the data in [B, C) is still stale so once we clear the > pagecache, this stale data is exposed. > > A B C E F > +wwwwwwwwwwwwwwSSSSSS wwwwwwww > > The reason this problem doesn't occur on normal inode is because normal inode > doesn't have a post EOF written extent. That's incorrect - we can have post-eof written extents on normal files. The reason this doesn't get exposed for normal files is that the zeroing range used in iomap_truncate_page() covers the entire filesystem block and writeback can write the entire EOF page that covers that block containing the zeroes. Hence when we remove all the written extents beyond EOF later in the truncate, we don't leave any blocks beyond EOF that we haven't zeroed. > For realtime inode, I guess it's not > enough to just zero the EOF block (xfs_setattr_size()->xfs_truncate_page()), > we should also zero the extra blocks that aligned to realtime extent size > before updating i_size. Any suggestions? Right. xfs_setattr_size() needs fixing to flush the entire zeroed range *before* truncating the page cache and changing the inode size. Of course, xfs_truncate_page() also needs fixing to zero the entire rtextsize range, not use iomap_truncate_page() which only zeroes to the end of the EOF filesystem block. I note that dax_truncate_page() has the same problem with RT device block sizes as iomap_truncate_page(), so we need the same fix for both dax and non-dax paths here. It might actually be easiest to pass the block size for zeroing into iomap_truncate_page() rather than relying on being able to extract the zeroing range from the inode via i_blocksize(). We can't use i_blocksize() for RT files, because inode->i_blkbits and hence i_blocksize() only supports power of 2 block sizes. Changing that is a *much* bigger job, so fixing xfs_truncate_page() is likely the best thing to do right now... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx