Re: [BUG REPORT] generic/561 fails when testing xfs on next-20240506 kernel

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 11 May 2024 13:45:17 +1000

On Sat, May 11, 2024 at 11:11:32AM +0800, Zhang Yi wrote:
> On 2024/5/8 17:01, Chandan Babu R wrote:
> > Hi,
> > 
> > generic/561 fails when testing XFS on a next-20240506 kernel as shown below,
> > 
> > # ./check generic/561
> > FSTYP         -- xfs (debug)
> > PLATFORM      -- Linux/x86_64 xfs-crc-rtdev-extsize-28k 6.9.0-rc7-next-20240506+ #1 SMP PREEMPT_DYNAMIC Mon May  6 07:53:46 GMT 2024
> > MKFS_OPTIONS  -- -f -rrtdev=/dev/loop14 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/loop5
> > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 -ortdev=/dev/loop14 /dev/loop5 /media/scratch
> > 
> > generic/561       - output mismatch (see /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad)
> >     --- tests/generic/561.out   2024-05-06 08:18:09.681430366 +0000
> >     +++ /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad        2024-05-08 09:14:24.908010133 +0000
> >     @@ -1,2 +1,5 @@
> >      QA output created by 561
> >     +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d895/d1307XXX/d8a4/d832XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r112fXXXXXXXXXXX: FAILED
> >     +/media/scratch/dir/p0/d0XXXXXXXXXXXXXXXXXXXXXXX/d486/d4bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d5bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d212XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d11XXXXXXXXX/d54/de4/d158/d27f/d13a3XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d13c0XXXXXXXX/d2301X/d222bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1240XXXXXXXXXXXXXXXXXXXXXXXX/d722XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d1380XXXXXXXXXXXXXXXX/dc62XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/r10d5: FAILED
> >     +md5sum: WARNING: 2 computed checksums did NOT match
> >      Silence is golden
> >     ...
> >     (Run 'diff -u /var/lib/xfstests/tests/generic/561.out /var/lib/xfstests/results/xfs-crc-rtdev-extsize-28k/6.9.0-rc7-next-20240506+/xfs_crc_rtdev_extsize_28k/generic/561.out.bad'  to see the entire diff)
> > Ran: generic/561
> > Failures: generic/561
> > Failed 1 of 1 tests
> > 
> 
> Sorry about this regression. After debuging and analyzing the code, I notice
> that this problem could only happens on xfs realtime inode. The real problem
> is about realtime extent alignment.
> 
> Please assume that if we have a file that contains a written extent [A, D).
> We unaligned truncate to the file to B, in the middle of this written extent.
> 
>        A            B                  D
>       +wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
> 
> After the truncate, the i_size is set to B, but due to the sb_rextsize,
> xfs_itruncate_extents() truncate and aligned the written extent to C, so the
> data in [B, C) doesn't zeroed and becomes stale.
> 
>        A            B     C
>       +wwwwwwwwwwwwwwSSSSSS
>                     ^
>                    EOF

This region must be zeroed on disk before we call
xfs_itruncate_extents().  i.e completed xfs_setattr_size() via
xfs_truncate_page() and flushed to disk before we start removing
extents.

The problem is that iomap_truncate_page() only zeros the trailing
portion of the i_blocksize() value, which is wrong for realtime
devices with rtextsize != fs blocksize.

Further, xfs_setattr_size() then calls truncate_setsize(newsize)
before the zeroing has been written back to disk, which means
that the flush that occurs immediately after the truncate_setsize()
call can not write blocks beyond the new EOF regardless of whether
iomap_truncate_page() wrote zeroes to them or not.

> The if we write [E, F) beyond this written extent, xfs_file_write_checks()->
> xfs_zero_range() would zero [B, C) in page cache, but since we don't increase
> i_size in iomap_zero_iter(), the writeback process doesn't write zero data
> to disk. After write, the data in [B, C) is still stale so once we clear the
> pagecache, this stale data is exposed.
> 
>        A            B     C        E      F
>       +wwwwwwwwwwwwwwSSSSSS        wwwwwwww
> 
> The reason this problem doesn't occur on normal inode is because normal inode
> doesn't have a post EOF written extent.

That's incorrect - we can have post-eof written extents on normal
files. The reason this doesn't get exposed for normal files is that
the zeroing range used in iomap_truncate_page() covers the entire
filesystem block and writeback can write the entire EOF page that
covers that block containing the zeroes. Hence when we remove all
the written extents beyond EOF later in the truncate, we don't leave
any blocks beyond EOF that we haven't zeroed.

> For realtime inode, I guess it's not
> enough to just zero the EOF block (xfs_setattr_size()->xfs_truncate_page()),
> we should also zero the extra blocks that aligned to realtime extent size
> before updating i_size. Any suggestions?

Right. xfs_setattr_size() needs fixing to flush the entire zeroed
range *before* truncating the page cache and changing the inode size.

Of course, xfs_truncate_page() also needs fixing to zero the 
entire rtextsize range, not use iomap_truncate_page() which only
zeroes to the end of the EOF filesystem block.

I note that dax_truncate_page() has the same problem with RT device
block sizes as iomap_truncate_page(), so we need the same fix for
both dax and non-dax paths here.

It might actually be easiest to pass the block size for zeroing into
iomap_truncate_page() rather than relying on being able to extract
the zeroing range from the inode via i_blocksize(). We can't use
i_blocksize() for RT files, because inode->i_blkbits and hence
i_blocksize() only supports power of 2 block sizes. Changing that is
a *much* bigger job, so fixing xfs_truncate_page() is likely the
best thing to do right now...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx