Re: [PATCH v2 2/2] xfs: fix buffer state management in xrep_findroot_block

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 3 Oct 2018 07:05:27 -0400

On Tue, Oct 02, 2018 at 07:02:11PM -0700, Darrick J. Wong wrote:
> We don't handle buffer state properly in online repair's findroot
> routine.  If a buffer already has b_ops set, we don't ever want to touch
> that, and we don't want to call the read verifiers on a buffer that
> could be dirty (CRCs are only recomputed during log checkpoints).
> 
> Therefore, be more careful about what we do with a buffer -- if someone
> else already attached ops that are not the ones for this btree type,
> just ignore the buffer.  We only attach our btree type's buf ops if it
> matches the magic/uuid and structure checks.
> 
> We also modify xfs_buf_read_map to allow callers to set buffer ops on a
> DONE buffer with NULL ops so that repair doesn't leave behind buffers
> which won't have buffers attached to them.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> ---
>  fs/xfs/scrub/repair.c  |   63 +++++++++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf.c       |   12 +++++++++
>  fs/xfs/xfs_trans.h     |    1 +
>  fs/xfs/xfs_trans_buf.c |   13 ++++++++++
>  4 files changed, 75 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 63786341ac2a..cebaebb26566 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -29,6 +29,8 @@
>  #include "xfs_ag_resv.h"
>  #include "xfs_trans_space.h"
>  #include "xfs_quota.h"
> +#include "xfs_attr.h"
> +#include "xfs_reflink.h"
>  #include "scrub/xfs_scrub.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
> @@ -699,7 +701,7 @@ xrep_findroot_block(
>  	struct xfs_btree_block		*btblock;
>  	xfs_daddr_t			daddr;
>  	int				block_level;
> -	int				error;
> +	int				error = 0;
>  
>  	daddr = XFS_AGB_TO_DADDR(mp, ri->sc->sa.agno, agbno);
>  
> @@ -718,28 +720,61 @@ xrep_findroot_block(
>  			return error;
>  	}
>  
> +	/*
> +	 * Read the buffer into memory so that we can see if it's a match for
> +	 * our btree type.  We have no clue if it is beforehand, and we want to
> +	 * avoid xfs_trans_read_buf's behavior of dumping the DONE state (which
> +	 * will cause needless disk reads in subsequent calls to this function)
> +	 * and logging metadata verifier failures.
> +	 *
> +	 * Therefore, pass in NULL buffer ops.  If the buffer was already in
> +	 * memory from some other caller it will already have b_ops assigned.
> +	 * If it was in memory from a previous unsuccessful findroot_block
> +	 * call, the buffer won't have b_ops but it should be clean and ready
> +	 * for us to try to verify if the read call succeeds.  The same applies
> +	 * if the buffer wasn't in memory at all.
> +	 *
> +	 * Note: If we never match a btree type with this buffer, it will be
> +	 * left in memory with NULL b_ops.  This shouldn't be a problem unless
> +	 * the buffer gets written.
> +	 */
>  	error = xfs_trans_read_buf(mp, ri->sc->tp, mp->m_ddev_targp, daddr,
>  			mp->m_bsize, 0, &bp, NULL);
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * Does this look like a block matching our fs and higher than any
> -	 * other block we've found so far?  If so, reattach buffer verifiers
> -	 * so the AIL won't complain if the buffer is also dirty.
> -	 */
> +	/* Ensure the block magic matches the btree type we're looking for. */
>  	btblock = XFS_BUF_TO_BLOCK(bp);
>  	if (be32_to_cpu(btblock->bb_magic) != fab->magic)
>  		goto out;
> -	if (xfs_sb_version_hascrc(&mp->m_sb) &&
> -	    !uuid_equal(&btblock->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid))
> -		goto out;
> -	bp->b_ops = fab->buf_ops;
>  
> -	/* Make sure we pass the verifiers. */
> -	bp->b_ops->verify_read(bp);
> -	if (bp->b_error)
> -		goto out;
> +	/*
> +	 * If the buffer already has ops applied and they're not the ones for
> +	 * this btree type, we know this block doesn't match the btree and we
> +	 * can bail out.
> +	 *
> +	 * If the buffer ops match ours, someone else has already validated
> +	 * the block for us, so we can move on to checking if this is a root
> +	 * block candidate.
> +	 *
> +	 * If the buffer does not have ops, nobody has successfully validated
> +	 * the contents and the buffer cannot be dirty.  If the magic, uuid,
> +	 * and structure match this btree type then we'll move on to checking
> +	 * if it's a root block candidate.  If there is no match, bail out.
> +	 */
> +	if (bp->b_ops) {
> +		if (bp->b_ops != fab->buf_ops)
> +			goto out;
> +	} else {
> +		ASSERT(!xfs_trans_buf_is_dirty(bp));
> +		if (!uuid_equal(&btblock->bb_u.s.bb_uuid,
> +				&mp->m_sb.sb_meta_uuid))
> +			goto out;
> +		fab->buf_ops->verify_read(bp);
> +		if (bp->b_error)
> +			goto out;
> +		bp->b_ops = fab->buf_ops;
> +	}
>  
>  	/*
>  	 * This block passes the magic/uuid and verifier tests for this btree
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e839907e8492..7aa78cef9c3a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c

Separate patch for the core buffer changes please. :)

> @@ -777,6 +777,18 @@ xfs_buf_read_map(
>  			xfs_buf_relse(bp);
>  			return NULL;
>  		} else {
> +			/*
> +			 * Online repair can leave done buffers in memory with
> +			 * no b_ops assigned.  If the caller passed in an ops
> +			 * structure, see if the buffer passes the verifier.
> +			 * If successful, assign the ops to the buffer.
> +			 */
> +			if (!bp->b_ops && ops) {
> +				ops->verify_read(bp);
> +				if (!bp->b_error)
> +					bp->b_ops = ops;

Why the conditional assignment? The read branch above looks like all
current ops != NULL callers essentially force attach ->b_ops. E.g., a
verification failure is an unconditional I/O error in this context, so
ops is not considered "prospective" at this layer like it is in the
scrub context. Not that it matters that much, but I think it's strange
to attach ->b_ops (on failure) based on whether the buffer had already
been read unverified or not.

Another quirk to consider here is that we currently don't set XBF_DONE
on read verifier failure. Should we clear XBF_DONE in the event of
deferred verification failure so a follow on read can retry?

> +			}
> +

And since this is a core infrastructure change, I also think we need to
step back and think about how this could be used in a variety of cases
going forward (not just our current scrub use case). For one (and nobody
does this right now), but it looks like an xfs_trans_read_buf(..., NULL)
caller followed by an xfs_trans_read_buf(..., ops) caller would still
never verify the buffer because it can be looked up in the transaction
without hitting the read path.

Also, it may be technically Ok to defer verification in the readahead
case of a previously unverified buffer, but I _think_ I'm of the opinion
that we should verify the buffer ASAP when the caller has provided a
non-NULL ops. So if verification currently happens on read-ahead in some
particular workload (in combination with scrub), we should preserve that
behavior. At minimum, doing so ensures we don't lose potential
verification failures if a particular readahead buffer never otherwise
ends up being read "for real."

Hmm, I'm wondering if it might be a good idea to factor some combination
of the current ->verify_read() call and ->b_error handling into a small
helper, make sure we call it in all the right places and conditionalize
the error handling to callers that aren't in ioend context. I dunno,
maybe there's a cleaner way to do that. I also think we could use some
asserts to (for example) make sure we always return a ->b_ops != NULL
buffer if the caller passed a buf_ops. FWIW, I'm also not against the
XBF_VERIFIED thing to facilitate that, if that's cleaner than just
checking ->b_ops everywhere.

Brian

>  			/* We do not want read in the flags */
>  			bp->b_flags &= ~XBF_READ;
>  		}
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index c3d278e96ad1..a0c5dbda18aa 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -220,6 +220,7 @@ void		xfs_trans_ijoin(struct xfs_trans *, struct xfs_inode *, uint);
>  void		xfs_trans_log_buf(struct xfs_trans *, struct xfs_buf *, uint,
>  				  uint);
>  void		xfs_trans_dirty_buf(struct xfs_trans *, struct xfs_buf *);
> +bool		xfs_trans_buf_is_dirty(struct xfs_buf *bp);
>  void		xfs_trans_log_inode(xfs_trans_t *, struct xfs_inode *, uint);
>  
>  void		xfs_extent_free_init_defer_op(void);
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 15919f67a88f..bdc5be4d0210 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -321,6 +321,19 @@ xfs_trans_read_buf_map(
>  
>  }
>  
> +/* Has this buffer been dirtied by anyone? */
> +bool
> +xfs_trans_buf_is_dirty(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_buf_log_item	*bip = bp->b_log_item;
> +
> +	if (!bip)
> +		return false;
> +	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
> +	return test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags);
> +}
> +
>  /*
>   * Release the buffer bp which was previously acquired with one of the
>   * xfs_trans_... buffer allocation routines if the buffer has not