Re: [RFC PATCH] xfs: flush posteof zeroing before reflink truncation

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 19 Nov 2018 11:26:11 +1100

On Sun, Nov 18, 2018 at 09:12:06AM -0500, Brian Foster wrote:
> On Sat, Nov 17, 2018 at 07:37:56AM +1100, Dave Chinner wrote:
> > On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > 
> > > If we're remapping into a range that starts beyond EOF, we have to zero
> > > the memory between EOF and the start of the target range, as established
> > > in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
> > > truncation range downwards to a page boundary to guarantee that
> > > pagecache pages are removed and that there's no possibility that we end
> > > up zeroing subpage blocks within a page.  Unfortunately, we never commit
> > > the posteof zeroing to disk, so on a filesystem where page size > block
> > > size the truncation partially undoes the zeroing and we end up with
> > > stale disk contents.
> > > 
> > > Brian and I reproduced this problem by running generic/091 on a 1k block
> > > xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> > > 
> > > Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> > > Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> > > Simultaneously-diagnosed-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > Ok, I have a different fix for this again.
> > 
> > > ---
> > > Note: I haven't tested this thoroughly but wanted to push this out for
> > > everyone to look at ASAP.
> > > ---
> > >  fs/xfs/xfs_reflink.c |    8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index c56bdbfcf7ae..8ea09a7e550c 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
> > >  	loff_t			pos)
> > >  {
> > >  	loff_t			isize = i_size_read(VFS_I(ip));
> > > +	int			error;
> > >  
> > >  	if (pos <= isize)
> > >  		return 0;
> > >  
> > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > +	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > >  			&xfs_iomap_ops);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > > +			isize, pos - 1);
> > 
> > This doesn't work on block size > page size setups, unfortunately.
> > 
> > Immediately after this we truncate the page cache, which also
> > doesn't do the right thing on block size > page cache setups.
> > So there's a couple of bugs here.
> > 
> > IMO, the truncate needs fixing, not the zeroing. Flushing after
> > zeroing leaves a potential landmine of other dirty data not getting
> > flushed properly before the truncate, so we should fix the truncate
> > to do a flush first. And we should fix it in a way that doesn't mean
> > we need to fix it again in the very near future. i.e. the patch
> > below that uses xfs_flush_unmap_range().
> > 
> > FWIW, I'm workng on cleaning up the ~10 patches I have for various
> > fsx and other corruption fixes so I can post them - it'll be monday
> > before I get that done - but if you're having fsx failures w/
> > copy/dedupe/clone on fsx I've probably already got a fix for it...
> > 
> 
> Ok, so FYI this doesn't actually address the writeback issue I
> reproduced because the added flush targets the start of the destination
> offset.

Oh, sorry, I didn't notice that difference. Fixed that now. That
might actually be one(*) of the fsx bugs I've been chasing for
several days.

> Note again that I think this is distinct from the issue both you
> and Darrick have documented in each commit log. Darrick's patch
> addresses it because the flush targets the range that has been zeroed
> (isize to dest offset) and thus potentially dirtied in cache. The
> zeroing is what leads to sharing a block with an active dirty page (and
> eventually leads to writeback of a page to a shared block).

Yes, we may be seeing *different symptoms* but the underlying
problem is the same - we are not writing back pages over the range
we are about to share, and so we don't trigger a COW on the range
before writeback occurs.

IMO, using xfs_flush_unmap_range() is still the right change to make
here, even if my initial patch didn't address this specific problem
but a different flush/inval problem with this code.

Cheers.

Dave.

(*) I've still got several different fsx variants that fail on either
default configs and/or 1k block size with different signatures.
Problem is they take between 370,000 ops and 5 million ops to
trigger, and so generate tens to hundreds of GB of trace data....

e.g. on a default 4k filesystem on pmem, this fails after 377k ops:

# ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/scratch/foo
....
READ BAD DATA: offset = 0x31000, size = 0xa000, fname = /mnt/scratch/foo
OFFSET  GOOD    BAD     RANGE
0x36000 0x9084  0x7940  0x00000
....

and this slight variant (buffered IO rather than direct IO) fails
after 2.17 million ops:

# ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -R -W /mnt/scratch/foo
....
READ BAD DATA: offset = 0xf000, size = 0xf318, fname = /mnt/scratch/foo
OFFSET  GOOD    BAD     RANGE
0x15000 0x990b  0x6c0b  0x00000
....

I'm also seeing MAPREAD failures with data after EOF on several
different configs, and there's a couple of other failures that show
up every so often, too.

If I turn off copy/dedupe/clone file range, they all run for
billions of ops without failure....

-- 
Dave Chinner
david@xxxxxxxxxxxxx