On 2024/5/11 11:45, Dave Chinner wrote: > On Sat, May 11, 2024 at 11:11:32AM +0800, Zhang Yi wrote: >> On 2024/5/8 17:01, Chandan Babu R wrote: >>> Hi, >>> >>> generic/561 fails when testing XFS on a next-20240506 kernel as shown below, >>> >>> # ./check generic/561 >>> FSTYP -- xfs (debug) >>> PLATFORM -- Linux/x86_64 xfs-crc-rtdev-extsize-28k 6.9.0-rc7-next-20240506+ #1 SMP PREEMPT_DYNAMIC Mon May 6 07:53:46 GMT 2024 >>> MKFS_OPTIONS -- -f -rrtdev=/dev/loop14 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/loop5 >>> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 -ortdev=/dev/loop14 /dev/loop5 /media/scratch >>> >>> generic/561 - output mismatch (see /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad) >>> --- tests/generic/561.out 2024-05-06 08:18:09.681430366 +0000 >>> +++ /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad 2024-05-08 09:14:24.908010133 +0000 >>> @@ -1,2 +1,5 @@ >>> QA output created by 561 >>> +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d895/d1307XXX/d8a4/d832XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r112fXXXXXXXXXXX: FAILED >>> +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d13a3XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d13c0XXXXXXXX/d2301X/d222bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1240XXXXXXXXXXXXXXXXXXXXXXXX/d722XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1380XXXXXXXXXXXXXXXX/dc62XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r10d5: FAILED >>> +md5sum: WARNING: 2 computed checksums did NOT match >>> Silence is golden >>> ... >>> (Run 'diff -u /var/lib/xfstests/tests/generic/561.out /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad' to see the entire diff) >>> Ran: generic/561 >>> Failures: generic/561 >>> Failed 1 of 1 tests >>> >> >> Sorry about this regression. After debuging and analyzing the code, I notice >> that this problem could only happens on xfs realtime inode. The real problem >> is about realtime extent alignment. >> >> Please assume that if we have a file that contains a written extent [A, D). >> We unaligned truncate to the file to B, in the middle of this written extent. >> >> A B D >> +wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww >> >> After the truncate, the i_size is set to B, but due to the sb_rextsize, >> xfs_itruncate_extents() truncate and aligned the written extent to C, so the >> data in [B, C) doesn't zeroed and becomes stale. >> >> A B C >> +wwwwwwwwwwwwwwSSSSSS >> ^ >> EOF > > This region must be zeroed on disk before we call > xfs_itruncate_extents(). i.e completed xfs_setattr_size() via > xfs_truncate_page() and flushed to disk before we start removing > extents. > > The problem is that iomap_truncate_page() only zeros the trailing > portion of the i_blocksize() value, which is wrong for realtime > devices with rtextsize != fs blocksize. > > Further, xfs_setattr_size() then calls truncate_setsize(newsize) > before the zeroing has been written back to disk, which means > that the flush that occurs immediately after the truncate_setsize() > call can not write blocks beyond the new EOF regardless of whether > iomap_truncate_page() wrote zeroes to them or not. > >> The if we write [E, F) beyond this written extent, xfs_file_write_checks()-> >> xfs_zero_range() would zero [B, C) in page cache, but since we don't increase >> i_size in iomap_zero_iter(), the writeback process doesn't write zero data >> to disk. After write, the data in [B, C) is still stale so once we clear the >> pagecache, this stale data is exposed. >> >> A B C E F >> +wwwwwwwwwwwwwwSSSSSS wwwwwwww >> >> The reason this problem doesn't occur on normal inode is because normal inode >> doesn't have a post EOF written extent. > > That's incorrect - we can have post-eof written extents on normal > files. The reason this doesn't get exposed for normal files is that > the zeroing range used in iomap_truncate_page() covers the entire > filesystem block and writeback can write the entire EOF page that > covers that block containing the zeroes. Hence when we remove all > the written extents beyond EOF later in the truncate, we don't leave > any blocks beyond EOF that we haven't zeroed. > >> For realtime inode, I guess it's not >> enough to just zero the EOF block (xfs_setattr_size()->xfs_truncate_page()), >> we should also zero the extra blocks that aligned to realtime extent size >> before updating i_size. Any suggestions? > > Right. xfs_setattr_size() needs fixing to flush the entire zeroed > range *before* truncating the page cache and changing the inode size. > > Of course, xfs_truncate_page() also needs fixing to zero the > entire rtextsize range, not use iomap_truncate_page() which only > zeroes to the end of the EOF filesystem block. > > I note that dax_truncate_page() has the same problem with RT device > block sizes as iomap_truncate_page(), so we need the same fix for > both dax and non-dax paths here. > > It might actually be easiest to pass the block size for zeroing into > iomap_truncate_page() rather than relying on being able to extract > the zeroing range from the inode via i_blocksize(). We can't use > i_blocksize() for RT files, because inode->i_blkbits and hence > i_blocksize() only supports power of 2 block sizes. Changing that is > a *much* bigger job, so fixing xfs_truncate_page() is likely the > best thing to do right now... > Thanks for the explanation and suggestion, I agree with you. However, why do you recommend to pass block size for zeroing in to iomap_truncate_page()? It's looks like we could fix xfs_truncate_page() by using iomap_zero_range() and dax_zero_range() instead and don't use iomap_truncate_page() and dax_truncate_page(). Thanks, Yi.