Re: [PATCH 08/27] xfs: scrub the shape of a metadata btree

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Fri, 22 Sep 2017 10:22:07 -0700

On Fri, Sep 22, 2017 at 11:22:33AM -0400, Brian Foster wrote:
> On Wed, Sep 20, 2017 at 05:18:26PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > Create a function that can check the shape of a btree -- each block
> > passes basic inspection and all the pointers look ok.  In the next patch
> > we'll add the ability to check the actual keys and records stored within
> > the btree.  Add some helper functions so that we report detailed scrub
> > errors in a uniform manner in dmesg.  These are helper functions for
> > subsequent patches.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > ---
> >  fs/xfs/libxfs/xfs_btree.c |   16 +++
> >  fs/xfs/libxfs/xfs_btree.h |    7 +
> >  fs/xfs/scrub/btree.c      |  236 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/common.h     |   13 ++
> >  4 files changed, 268 insertions(+), 4 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> > index adf5d09..a9c2bf3 100644
> > --- a/fs/xfs/scrub/btree.c
> > +++ b/fs/xfs/scrub/btree.c
> ...
> > @@ -109,6 +255,92 @@ xfs_scrub_btree(
> >  	struct xfs_owner_info		*oinfo,
> >  	void				*private)
> >  {
> > -	xfs_scrub_btree_op_ok(sc, cur, 0, false);
> > -	return -EOPNOTSUPP;
> > +	struct xfs_scrub_btree		bs = {0};
> > +	union xfs_btree_ptr		ptr;
> > +	union xfs_btree_ptr		*pp;
> > +	struct xfs_btree_block		*block;
> > +	int				level;
> > +	struct xfs_buf			*bp;
> > +	int				i;
> > +	int				error = 0;
> > +
> > +	/* Initialize scrub state */
> > +	bs.cur = cur;
> > +	bs.scrub_rec = scrub_fn;
> > +	bs.oinfo = oinfo;
> > +	bs.firstrec = true;
> > +	bs.private = private;
> > +	bs.sc = sc;
> > +	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> > +		bs.firstkey[i] = true;
> > +	INIT_LIST_HEAD(&bs.to_check);
> > +
> > +	/* Don't try to check a tree with a height we can't handle. */
> > +	if (!xfs_scrub_btree_check_ok(sc, cur, 0, cur->bc_nlevels > 0 &&
> > +			cur->bc_nlevels <= XFS_BTREE_MAXLEVELS))
> > +		goto out;
> > +
> > +	/* Make sure the root isn't in the superblock. */
> > +	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)) {
> > +		cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > +		error = xfs_scrub_btree_ptr(&bs, cur->bc_nlevels, &ptr);
> > +		if (!xfs_scrub_btree_op_ok(sc, cur, cur->bc_nlevels - 1, &error))
> > +			goto out;
> > +	}
> > +
> 
> This is kind of in line with Dave's comments on the previous patch that
> introduce some of these helpers. I just glanced over them for now
> because I didn't have enough context to grok the error processing.

(That's been a struggle with this patchset -- some of these helpers
don't get used until much further in the patchset.  I could have
sprinkled them into whichever patch uses them first, but now the hunks
are all over the series, I'd have to do more dependency tracking to make
sure bisect continues to work, and the frequency of auto-merge failures
as I push and pop the stack increase dramatically.)

> FWIW, the btree_op_ok()/btree_check_ok() stuff kind of makes my eyes
> cross a bit because I can't easily see the logic checks or distinguish
> between those and error code checks. This is also a bit confusing
> because it looks like we overload return codes for various things. E.g.,
> we generate -EFSCORRUPTED in some cases just so the caller can set the
> state on the context and clear it, but then we still use the fact that
> an error _was_ set to control the flow of the task via the op_ok()
> return value. This makes some of the code flow/decision making logic
> hard to follow, particularly since some of that state looks like it can
> be lost.
> 
> Case in point.. what happens if say xfs_btree_increment() returns
> -EFSCORRUPTED back to xfs_scrub_btree_block_check_sibling()? It looks to
> me that the latter calls btree_op_ok() to set the corrupt state, clears
> the error code and skips out appropriately.
> xfs_scrub_btree_block_check_sibling() now returns zero, which
> potentially bubbles up to xfs_scrub_btree() where we check the error
> code again. Is it expected that error == 0 here? What is supposed to
> happen here?

Yes, error == 0 is intended here.  Given a block B, we want to check
that B->rightsib->leftsib (if the sibling exists at all) point back to
B.  If the btree_increment operation returns EFSCORRUPTED we don't know
if that's because the B->rightsib points at something it's not supposed
to, or if B->rightsib points at a btree block, but that sibling block is
corrupt.  Therefore we set the corrupt flag and bubble error == 0 up the
call stack so that we can check the other records in the btree.   This
enables those following with ftrace to see everything that scrub thinks
is wrong with that piece of metadata.

IOWs, we only use error code returns for "runtime error, abort the whole
operation immediately".

> I'm wondering if this could all be made more clear by trying to
> explicitly separate out operational errors, scrub failures and whatever
> we want to call the logic that clears an -EFSCORRUPTED/-EFSBADCRC error
> code but still indicates something happened. :P
> 
> For starters, rather than wrap every logic check with btree_op_check(),
> could we use explicit logic and let each function update the context
> based on problems it found? For example, something like the following is
> much more easy to read for me than the associated logic above:
> 
> 	/* Don't try to check a tree with a height we can't handle. */
> 	if (!(cur->bc_nlevels > 0 &&
> 	      cur->bc_nlevels <= XFS_BTREE_MAXLEVELS)) {
> 		xfs_scrub_sc_corrupt(...);
> 		goto out;
> 	}
> 
> And of course the context update calls could be factored into an
> out_corrupt label or something where appropriate.

Yes, that could be done.

> Beyond that, where we need to identify a bit of metadata is busted to
> perhaps do something like skip it but not abort (as we may have filtered
> out an -EFSCORRUPTED) return code, could we pass a flag down a
> particular callchain (i.e., think 'bool *bad' or 'int *stat' a la the
> core btree code)? Then we can still transfer that state back up the
> chain and the caller(s) can distinguish operational errors from "this
> thing is corrupted, act accordingly," regardless of how the corruption
> was detected.

So far I haven't needed to distinguish between "no problems encountered"
and "this callchain hit a verifier error so we just set _CORRUPT" --
scrub always keeps going until it runs out of things to check.

(Maybe I'm missing something?)

--D

> 
> Brian
> 
> > +	/* Load the root of the btree. */
> > +	level = cur->bc_nlevels - 1;
> > +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > +	error = xfs_scrub_btree_block(&bs, level, &ptr, &block, &bp);
> > +	if (!xfs_scrub_btree_op_ok(sc, cur, cur->bc_nlevels - 1, &error))
> > +		goto out;
> > +
> > +	cur->bc_ptrs[level] = 1;
> > +
> > +	while (level < cur->bc_nlevels) {
> > +		block = xfs_btree_get_block(cur, level, &bp);
> > +
> > +		if (level == 0) {
> > +			/* End of leaf, pop back towards the root. */
> > +			if (cur->bc_ptrs[level] >
> > +			    be16_to_cpu(block->bb_numrecs)) {
> > +				if (level < cur->bc_nlevels - 1)
> > +					cur->bc_ptrs[level + 1]++;
> > +				level++;
> > +				continue;
> > +			}
> > +
> > +			if (xfs_scrub_should_terminate(&error))
> > +				break;
> > +
> > +			cur->bc_ptrs[level]++;
> > +			continue;
> > +		}
> > +
> > +		/* End of node, pop back towards the root. */
> > +		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> > +			if (level < cur->bc_nlevels - 1)
> > +				cur->bc_ptrs[level + 1]++;
> > +			level++;
> > +			continue;
> > +		}
> > +
> > +		/* Drill another level deeper. */
> > +		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> > +		error = xfs_scrub_btree_ptr(&bs, level, pp);
> > +		if (error) {
> > +			error = 0;
> > +			cur->bc_ptrs[level]++;
> > +			continue;
> > +		}
> > +		level--;
> > +		error = xfs_scrub_btree_block(&bs, level, pp, &block, &bp);
> > +		if (!xfs_scrub_btree_op_ok(sc, cur, level, &error))
> > +			goto out;
> > +
> > +		cur->bc_ptrs[level] = 1;
> > +	}
> > +
> > +out:
> > +	return error;
> >  }
> > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > index e1bb14b..9920488 100644
> > --- a/fs/xfs/scrub/common.h
> > +++ b/fs/xfs/scrub/common.h
> > @@ -20,6 +20,19 @@
> >  #ifndef __XFS_SCRUB_COMMON_H__
> >  #define __XFS_SCRUB_COMMON_H__
> >  
> > +/* Should we end the scrub early? */
> > +static inline bool
> > +xfs_scrub_should_terminate(
> > +	int		*error)
> > +{
> > +	if (fatal_signal_pending(current)) {
> > +		if (*error == 0)
> > +			*error = -EAGAIN;
> > +		return true;
> > +	}
> > +	return false;
> > +}
> > +
> >  /*
> >   * Grab a transaction.  If we're going to repair something, we need to
> >   * ensure there's enough reservation to make all the changes.  If not,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html