Re: [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 3 Nov 2022 09:36:05 +1100

On Wed, Nov 02, 2022 at 10:19:33AM -0700, Darrick J. Wong wrote:
> On Tue, Nov 01, 2022 at 11:34:11AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Now that iomap supports a mechanism to validate cached iomaps for
> > buffered write operations, hook it up to the XFS buffered write ops
> > so that we can avoid data corruptions that result from stale cached
> > iomaps. See:
> > 
> > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/
> > 
> > or the ->iomap_valid() introduction commit for exact details of the
> > corruption vector.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |  4 +--
> >  fs/xfs/xfs_aops.c        |  2 +-
> >  fs/xfs/xfs_iomap.c       | 69 ++++++++++++++++++++++++++++++----------
> >  fs/xfs/xfs_iomap.h       |  4 +--
> >  fs/xfs/xfs_pnfs.c        |  5 +--
> >  5 files changed, 61 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 49d0d4ea63fc..db225130618c 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -4551,8 +4551,8 @@ xfs_bmapi_convert_delalloc(
> >  	 * the extent.  Just return the real extent at this offset.
> >  	 */
> >  	if (!isnullstartblock(bma.got.br_startblock)) {
> > -		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
> >  		*seq = READ_ONCE(ifp->if_seq);
> > +		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
> >  		goto out_trans_cancel;
> >  	}
> >  
> > @@ -4599,8 +4599,8 @@ xfs_bmapi_convert_delalloc(
> >  	XFS_STATS_INC(mp, xs_xstrat_quick);
> >  
> >  	ASSERT(!isnullstartblock(bma.got.br_startblock));
> > -	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
> >  	*seq = READ_ONCE(ifp->if_seq);
> > +	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
> >  
> >  	if (whichfork == XFS_COW_FORK)
> 
> Hmm.  @ifp here could be the cow fork, which means *seq will be from the
> cow fork.  This I think is going to cause problems in
> xfs_buffered_write_iomap_valid...

We can't get here from the buffered write path. This can only be
reached by the writeback path via:

iomap_do_writepage
  iomap_writepage_map
    xfs_map_blocks
      xfs_convert_blocks
        xfs_bmapi_convert_delalloc

Hence the sequence numbers stored in iomap->private here are
completely ignored by the writeback code. They still get stored
in the struct xfs_writepage_ctx and checked by xfs_imap_valid(), so
the changes here were really just making sure all the
xfs_bmbt_to_iomap() callers stored the sequence number consistently.

> > @@ -839,24 +848,25 @@ xfs_direct_write_iomap_begin(
> >  	xfs_iunlock(ip, lockmode);
> >  
> >  	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> > -			flags, &imap);
> > +			flags, &imap, &seq);
> >  	if (error)
> >  		return error;
> >  
> >  	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
> >  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
> > -				 iomap_flags | IOMAP_F_NEW);
> > +				 iomap_flags | IOMAP_F_NEW, seq);
> >  
> >  out_found_cow:
> > +	seq = READ_ONCE(ip->i_df.if_seq);
> >  	xfs_iunlock(ip, lockmode);
> >  	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
> >  	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
> >  	if (imap.br_startblock != HOLESTARTBLOCK) {
> > -		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
> > +		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
> >  		if (error)
> >  			return error;
> >  	}
> > -	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
> > +	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
> 
> Here we've found a mapping from the COW fork and we're encoding it into
> the struct iomap.  Why is the sequence number from the *data* fork and
> not the COW fork?

Because my brain isn't big enough to hold all this code straight in
my head.  I found it all too easy to get confused by what iomap is
passed where and how to tell them apart and I could not tell if
there was an easy way to tell if the iomap passed to
xfs_iomap_valid() needed to check against the data or cow fork.

Hence I set it up such that the xfs_iomap_valid() callback only
checks the data fork sequence so the only correct thing to set in
the iomap is the data fork sequence and moved on to the next part of
the problem...

But, again, this information probably should be encoded into the
sequence cookie we store. Which begs the question - if we are doing
a COW operation, we have to check that neither the data fork (the
srcmap) nor the COW fork (the iter->iomap) have changed, right? This
is what the writeback code does (i.e. checks both data and COW fork
sequence numbers), so perhaps that's also what we should be doing
here?

i.e. either the cookie for COW operations needs to contain both
data+cow sequence numbers, and both need to match to proceed, or we
have to validate both the srcmap and the iter->iomap with separate
callouts once the folio is locked.

Thoughts?

> > @@ -1328,9 +1342,26 @@ xfs_buffered_write_iomap_end(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Check that the iomap passed to us is still valid for the given offset and
> > + * length.
> > + */
> > +static bool
> > +xfs_buffered_write_iomap_valid(
> > +	struct inode		*inode,
> > +	const struct iomap	*iomap)
> > +{
> > +	int			seq = *((int *)&iomap->private);
> > +
> > +	if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq))
> 
> Why is it ok to sample the data fork's sequence number even if the
> mapping came from the COW fork?  That doesn't make sense to me,
> conceptually.  I definitely think it's buggy that the revalidation
> might not sample from the same sequence counter as the original
> measurement.

Yup, see above.

> >  const struct iomap_ops xfs_read_iomap_ops = {
> > @@ -1438,7 +1471,7 @@ xfs_seek_iomap_begin(
> >  			end_fsb = min(end_fsb, data_fsb);
> >  		xfs_trim_extent(&cmap, offset_fsb, end_fsb);
> >  		error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
> > -					  IOMAP_F_SHARED);
> > +				IOMAP_F_SHARED, READ_ONCE(ip->i_cowfp->if_seq));
> 
> Here we explicitly sample from the cow fork but the comparison is done
> against the data fork.  That /probably/ doesn't matter for reads since
> you're not proposing that we revalidate on those.  Should we?

As I said elsewhere, I haven't even looked at the read path. The
iomap code is now complex enough that I have trouble following it
and keeping all the corner cases in my head. This is the down side
of having people smarter than you write the code you need to debug
when shit inevitably goes wrong.

> What
> happens if userspace hands us a large read() from an unwritten extent
> and the same dirty/writeback/reclaim sequence happens to the last folio
> in that read() range before read_iter actually gets there?

No idea - shit may well go wrong there in the same way, but I've had
my hands full just fixing the write path. I'm *really* tired of
fighting with the iomap code right now...

> > @@ -189,7 +190,7 @@ xfs_fs_map_blocks(
> >  		xfs_iunlock(ip, lock_flags);
> >  
> >  		error = xfs_iomap_write_direct(ip, offset_fsb,
> > -				end_fsb - offset_fsb, 0, &imap);
> > +				end_fsb - offset_fsb, 0, &imap, &seq);
> 
> I got a compiler warning about seq not being set in the case where we
> don't call xfs_iomap_write_direct.

Yup, gcc 11.2 doesn't warn about it. Now to find out why my build
system is using gcc 11.2 when I upgraded it months ago to use
gcc-12 because of exactly these sorts of issues.... :(

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx