On Sun, Nov 18, 2018 at 09:12:06AM -0500, Brian Foster wrote: > On Sat, Nov 17, 2018 at 07:37:56AM +1100, Dave Chinner wrote: > > On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > If we're remapping into a range that starts beyond EOF, we have to zero > > > the memory between EOF and the start of the target range, as established > > > in 410fdc72b05af. However, in 4918ef4ea008, we extended the pagecache > > > truncation range downwards to a page boundary to guarantee that > > > pagecache pages are removed and that there's no possibility that we end > > > up zeroing subpage blocks within a page. Unfortunately, we never commit > > > the posteof zeroing to disk, so on a filesystem where page size > block > > > size the truncation partially undoes the zeroing and we end up with > > > stale disk contents. > > > > > > Brian and I reproduced this problem by running generic/091 on a 1k block > > > xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange. > > > > > > Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof") > > > Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink") > > > Simultaneously-diagnosed-by: Brian Foster <bfoster@xxxxxxxxxx> > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > Ok, I have a different fix for this again. > > > > > --- > > > Note: I haven't tested this thoroughly but wanted to push this out for > > > everyone to look at ASAP. > > > --- > > > fs/xfs/xfs_reflink.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c > > > index c56bdbfcf7ae..8ea09a7e550c 100644 > > > --- a/fs/xfs/xfs_reflink.c > > > +++ b/fs/xfs/xfs_reflink.c > > > @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof( > > > loff_t pos) > > > { > > > loff_t isize = i_size_read(VFS_I(ip)); > > > + int error; > > > > > > if (pos <= isize) > > > return 0; > > > > > > trace_xfs_zero_eof(ip, isize, pos - isize); > > > - return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL, > > > + error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL, > > > &xfs_iomap_ops); > > > + if (error) > > > + return error; > > > + > > > + return filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > > > + isize, pos - 1); > > > > This doesn't work on block size > page size setups, unfortunately. > > > > Immediately after this we truncate the page cache, which also > > doesn't do the right thing on block size > page cache setups. > > So there's a couple of bugs here. > > > > IMO, the truncate needs fixing, not the zeroing. Flushing after > > zeroing leaves a potential landmine of other dirty data not getting > > flushed properly before the truncate, so we should fix the truncate > > to do a flush first. And we should fix it in a way that doesn't mean > > we need to fix it again in the very near future. i.e. the patch > > below that uses xfs_flush_unmap_range(). > > > > FWIW, I'm workng on cleaning up the ~10 patches I have for various > > fsx and other corruption fixes so I can post them - it'll be monday > > before I get that done - but if you're having fsx failures w/ > > copy/dedupe/clone on fsx I've probably already got a fix for it... > > > > Ok, so FYI this doesn't actually address the writeback issue I > reproduced because the added flush targets the start of the destination > offset. Oh, sorry, I didn't notice that difference. Fixed that now. That might actually be one(*) of the fsx bugs I've been chasing for several days. > Note again that I think this is distinct from the issue both you > and Darrick have documented in each commit log. Darrick's patch > addresses it because the flush targets the range that has been zeroed > (isize to dest offset) and thus potentially dirtied in cache. The > zeroing is what leads to sharing a block with an active dirty page (and > eventually leads to writeback of a page to a shared block). Yes, we may be seeing *different symptoms* but the underlying problem is the same - we are not writing back pages over the range we are about to share, and so we don't trigger a COW on the range before writeback occurs. IMO, using xfs_flush_unmap_range() is still the right change to make here, even if my initial patch didn't address this specific problem but a different flush/inval problem with this code. Cheers. Dave. (*) I've still got several different fsx variants that fail on either default configs and/or 1k block size with different signatures. Problem is they take between 370,000 ops and 5 million ops to trigger, and so generate tens to hundreds of GB of trace data.... e.g. on a default 4k filesystem on pmem, this fails after 377k ops: # ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/scratch/foo .... READ BAD DATA: offset = 0x31000, size = 0xa000, fname = /mnt/scratch/foo OFFSET GOOD BAD RANGE 0x36000 0x9084 0x7940 0x00000 .... and this slight variant (buffered IO rather than direct IO) fails after 2.17 million ops: # ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -R -W /mnt/scratch/foo .... READ BAD DATA: offset = 0xf000, size = 0xf318, fname = /mnt/scratch/foo OFFSET GOOD BAD RANGE 0x15000 0x990b 0x6c0b 0x00000 .... I'm also seeing MAPREAD failures with data after EOF on several different configs, and there's a couple of other failures that show up every so often, too. If I turn off copy/dedupe/clone file range, they all run for billions of ops without failure.... -- Dave Chinner david@xxxxxxxxxxxxx