Re: [BUG REPORT] generic/561 fails when testing xfs on next-20240506 kernel

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Sat, 11 May 2024 15:43:17 +0800

On 2024/5/11 11:45, Dave Chinner wrote:
> On Sat, May 11, 2024 at 11:11:32AM +0800, Zhang Yi wrote:
>> On 2024/5/8 17:01, Chandan Babu R wrote:
>>> Hi,
>>>
>>> generic/561 fails when testing XFS on a next-20240506 kernel as shown below,
>>>
>>> # ./check generic/561
>>> FSTYP         -- xfs (debug)
>>> PLATFORM      -- Linux/x86_64 xfs-crc-rtdev-extsize-28k 6.9.0-rc7-next-20240506+ #1 SMP PREEMPT_DYNAMIC Mon May  6 07:53:46 GMT 2024
>>> MKFS_OPTIONS  -- -f -rrtdev=/dev/loop14 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/loop5
>>> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 -ortdev=/dev/loop14 /dev/loop5 /media/scratch
>>>
>>> generic/561       - output mismatch (see /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad)
>>>     --- tests/generic/561.out   2024-05-06 08:18:09.681430366 +0000
>>>     +++ /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad        2024-05-08 09:14:24.908010133 +0000
>>>     @@ -1,2 +1,5 @@
>>>      QA output created by 561
>>>     +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d895/d1307XXX/d8a4/d832XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r112fXXXXXXXXXXX: FAILED
>>>     +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d13a3XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d13c0XXXXXXXX/d2301X/d222bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1240XXXXXXXXXXXXXXXXXXXXXXXX/d722XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1380XXXXXXXXXXXXXXXX/dc62XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r10d5: FAILED
>>>     +md5sum: WARNING: 2 computed checksums did NOT match
>>>      Silence is golden
>>>     ...
>>>     (Run 'diff -u /var/lib/xfstests/tests/generic/561.out /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad'  to see the entire diff)
>>> Ran: generic/561
>>> Failures: generic/561
>>> Failed 1 of 1 tests
>>>
>>
>> Sorry about this regression. After debuging and analyzing the code, I notice
>> that this problem could only happens on xfs realtime inode. The real problem
>> is about realtime extent alignment.
>>
>> Please assume that if we have a file that contains a written extent [A, D).
>> We unaligned truncate to the file to B, in the middle of this written extent.
>>
>>        A            B                  D
>>       +wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
>>
>> After the truncate, the i_size is set to B, but due to the sb_rextsize,
>> xfs_itruncate_extents() truncate and aligned the written extent to C, so the
>> data in [B, C) doesn't zeroed and becomes stale.
>>
>>        A            B     C
>>       +wwwwwwwwwwwwwwSSSSSS
>>                     ^
>>                    EOF
> 
> This region must be zeroed on disk before we call
> xfs_itruncate_extents().  i.e completed xfs_setattr_size() via
> xfs_truncate_page() and flushed to disk before we start removing
> extents.
> 
> The problem is that iomap_truncate_page() only zeros the trailing
> portion of the i_blocksize() value, which is wrong for realtime
> devices with rtextsize != fs blocksize.
> 
> Further, xfs_setattr_size() then calls truncate_setsize(newsize)
> before the zeroing has been written back to disk, which means
> that the flush that occurs immediately after the truncate_setsize()
> call can not write blocks beyond the new EOF regardless of whether
> iomap_truncate_page() wrote zeroes to them or not.
> 
>> The if we write [E, F) beyond this written extent, xfs_file_write_checks()->
>> xfs_zero_range() would zero [B, C) in page cache, but since we don't increase
>> i_size in iomap_zero_iter(), the writeback process doesn't write zero data
>> to disk. After write, the data in [B, C) is still stale so once we clear the
>> pagecache, this stale data is exposed.
>>
>>        A            B     C        E      F
>>       +wwwwwwwwwwwwwwSSSSSS        wwwwwwww
>>
>> The reason this problem doesn't occur on normal inode is because normal inode
>> doesn't have a post EOF written extent.
> 
> That's incorrect - we can have post-eof written extents on normal
> files. The reason this doesn't get exposed for normal files is that
> the zeroing range used in iomap_truncate_page() covers the entire
> filesystem block and writeback can write the entire EOF page that
> covers that block containing the zeroes. Hence when we remove all
> the written extents beyond EOF later in the truncate, we don't leave
> any blocks beyond EOF that we haven't zeroed.
> 
>> For realtime inode, I guess it's not
>> enough to just zero the EOF block (xfs_setattr_size()->xfs_truncate_page()),
>> we should also zero the extra blocks that aligned to realtime extent size
>> before updating i_size. Any suggestions?
> 
> Right. xfs_setattr_size() needs fixing to flush the entire zeroed
> range *before* truncating the page cache and changing the inode size.
> 
> Of course, xfs_truncate_page() also needs fixing to zero the 
> entire rtextsize range, not use iomap_truncate_page() which only
> zeroes to the end of the EOF filesystem block.
> 
> I note that dax_truncate_page() has the same problem with RT device
> block sizes as iomap_truncate_page(), so we need the same fix for
> both dax and non-dax paths here.
> 
> It might actually be easiest to pass the block size for zeroing into
> iomap_truncate_page() rather than relying on being able to extract
> the zeroing range from the inode via i_blocksize(). We can't use
> i_blocksize() for RT files, because inode->i_blkbits and hence
> i_blocksize() only supports power of 2 block sizes. Changing that is
> a *much* bigger job, so fixing xfs_truncate_page() is likely the
> best thing to do right now...
> 

Thanks for the explanation and suggestion, I agree with you. However,
why do you recommend to pass block size for zeroing in to
iomap_truncate_page()? It's looks like we could fix xfs_truncate_page()
by using iomap_zero_range() and dax_zero_range() instead and don't use
iomap_truncate_page() and dax_truncate_page().

Thanks,
Yi.