Re: [PATCH 08/27] xfs: scrub the shape of a metadata btree

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 22 Sep 2017 17:15:20 -0400

On Fri, Sep 22, 2017 at 01:14:39PM -0700, Darrick J. Wong wrote:
> On Fri, Sep 22, 2017 at 03:13:20PM -0400, Brian Foster wrote:
> > On Fri, Sep 22, 2017 at 10:22:07AM -0700, Darrick J. Wong wrote:
> > > On Fri, Sep 22, 2017 at 11:22:33AM -0400, Brian Foster wrote:
> > > > On Wed, Sep 20, 2017 at 05:18:26PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > 
> > > > > Create a function that can check the shape of a btree -- each block
> > > > > passes basic inspection and all the pointers look ok.  In the next patch
> > > > > we'll add the ability to check the actual keys and records stored within
> > > > > the btree.  Add some helper functions so that we report detailed scrub
> > > > > errors in a uniform manner in dmesg.  These are helper functions for
> > > > > subsequent patches.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_btree.c |   16 +++
> > > > >  fs/xfs/libxfs/xfs_btree.h |    7 +
> > > > >  fs/xfs/scrub/btree.c      |  236 +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/scrub/common.h     |   13 ++
> > > > >  4 files changed, 268 insertions(+), 4 deletions(-)
> > > > > 
> > > > > 
> > > > ...
> > > > > diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> > > > > index adf5d09..a9c2bf3 100644
> > > > > --- a/fs/xfs/scrub/btree.c
> > > > > +++ b/fs/xfs/scrub/btree.c
> > > > ...
> > > > > @@ -109,6 +255,92 @@ xfs_scrub_btree(
> > > > >  	struct xfs_owner_info		*oinfo,
> > > > >  	void				*private)
> > > > >  {
> > > > > -	xfs_scrub_btree_op_ok(sc, cur, 0, false);
> > > > > -	return -EOPNOTSUPP;
> > > > > +	struct xfs_scrub_btree		bs = {0};
> > > > > +	union xfs_btree_ptr		ptr;
> > > > > +	union xfs_btree_ptr		*pp;
> > > > > +	struct xfs_btree_block		*block;
> > > > > +	int				level;
> > > > > +	struct xfs_buf			*bp;
> > > > > +	int				i;
> > > > > +	int				error = 0;
> > > > > +
> > > > > +	/* Initialize scrub state */
> > > > > +	bs.cur = cur;
> > > > > +	bs.scrub_rec = scrub_fn;
> > > > > +	bs.oinfo = oinfo;
> > > > > +	bs.firstrec = true;
> > > > > +	bs.private = private;
> > > > > +	bs.sc = sc;
> > > > > +	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> > > > > +		bs.firstkey[i] = true;
> > > > > +	INIT_LIST_HEAD(&bs.to_check);
> > > > > +
> > > > > +	/* Don't try to check a tree with a height we can't handle. */
> > > > > +	if (!xfs_scrub_btree_check_ok(sc, cur, 0, cur->bc_nlevels > 0 &&
> > > > > +			cur->bc_nlevels <= XFS_BTREE_MAXLEVELS))
> > > > > +		goto out;
> > > > > +
> > > > > +	/* Make sure the root isn't in the superblock. */
> > > > > +	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)) {
> > > > > +		cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > > > > +		error = xfs_scrub_btree_ptr(&bs, cur->bc_nlevels, &ptr);
> > > > > +		if (!xfs_scrub_btree_op_ok(sc, cur, cur->bc_nlevels - 1, &error))
> > > > > +			goto out;
> > > > > +	}
> > > > > +
> > > > 
> > > > This is kind of in line with Dave's comments on the previous patch that
> > > > introduce some of these helpers. I just glanced over them for now
> > > > because I didn't have enough context to grok the error processing.
> > > 
> > > (That's been a struggle with this patchset -- some of these helpers
> > > don't get used until much further in the patchset.  I could have
> > > sprinkled them into whichever patch uses them first, but now the hunks
> > > are all over the series, I'd have to do more dependency tracking to make
> > > sure bisect continues to work, and the frequency of auto-merge failures
> > > as I push and pop the stack increase dramatically.)
> > > 
> > > > FWIW, the btree_op_ok()/btree_check_ok() stuff kind of makes my eyes
> > > > cross a bit because I can't easily see the logic checks or distinguish
> > > > between those and error code checks. This is also a bit confusing
> > > > because it looks like we overload return codes for various things. E.g.,
> > > > we generate -EFSCORRUPTED in some cases just so the caller can set the
> > > > state on the context and clear it, but then we still use the fact that
> > > > an error _was_ set to control the flow of the task via the op_ok()
> > > > return value. This makes some of the code flow/decision making logic
> > > > hard to follow, particularly since some of that state looks like it can
> > > > be lost.
> > > > 
> > > > Case in point.. what happens if say xfs_btree_increment() returns
> > > > -EFSCORRUPTED back to xfs_scrub_btree_block_check_sibling()? It looks to
> > > > me that the latter calls btree_op_ok() to set the corrupt state, clears
> > > > the error code and skips out appropriately.
> > > > xfs_scrub_btree_block_check_sibling() now returns zero, which
> > > > potentially bubbles up to xfs_scrub_btree() where we check the error
> > > > code again. Is it expected that error == 0 here? What is supposed to
> > > > happen here?
> > > 
> > > Yes, error == 0 is intended here.  Given a block B, we want to check
> > > that B->rightsib->leftsib (if the sibling exists at all) point back to
> > > B.  If the btree_increment operation returns EFSCORRUPTED we don't know
> > > if that's because the B->rightsib points at something it's not supposed
> > > to, or if B->rightsib points at a btree block, but that sibling block is
> > > corrupt.  Therefore we set the corrupt flag and bubble error == 0 up the
> > > call stack so that we can check the other records in the btree.   This
> > > enables those following with ftrace to see everything that scrub thinks
> > > is wrong with that piece of metadata.
> > > 
> > > IOWs, we only use error code returns for "runtime error, abort the whole
> > > operation immediately".
> > > 
> > 
> > Ok, so we intentionally have to consume the error here because it
> > doesn't necessarily reflect the corrupted state of the scrubbed block.
> 
> Not quite -- while we're examining blocks or otherwise operating on the
> btree we've been told to scrub, we always want to consume an
> EFSCORRUPTED error (and set the CORRUPT flag) because we always want to
> try to check everything, even if we find problems midway through a scan.
> 

Then why don't we consume -EFSCORRUPTED errors from the call to
xfs_btree_check_block() in xfs_scrub_btree_block(), but leave it for the
caller? Note that the point I'm trying to get at is not to say that we
sometimes don't consume -EFSCORRUPTED errors at all, but rather we
consume them in different places to implicitly affect the code flow.

In the xfs_btree_check_block() case, my understanding is that we have to
leave that one for the caller to consume rather than consume it
immediately like we do in the check_sibling() code. If my understand is
not correct, what's the difference between those two sites with regard
to consuming -EFSCORRUPTED immediately in one and not the other?

> This is similar to how gcc will keep processing past the first error to
> try to report everything that's wrong in the source file instead of
> bailing out at the first error like it used to do.
> 
> > So IIUC, overloading return codes as such means error handling must
> > either return -EFSCORRUPTED for the current object being corrupted,
> 
> We /never/ return EFSCORRUPTED to userspace, because we have the
> OFLAG_CORRUPT flag to indicate any kind of corruption anywhere in this
> data structure we're checking.
> 

Understood, I'm not suggesting we'd return -EFSCORRUPTED to userspace.

> Let's say that a 2-level btree looks like this:
> 
>            B0
>             |
> +-----+-----+--------+-----+----------------+-------------------+
> |     |     |        |     |                |                   |
> |     |     |        V     |                |                   |
> |     |     |  someblock   |                |                   |
> |     |     |              |                |                   |
> V     V     V              V                V                   V
> B1 -> B2 -> B3 ----> B4 -> B5 (badmagic) -> B6 (bad records) -> B7
> 
> Here we have a bad pointer in B0 that should point to B4 but now points
> to something that was never part of B.  In B6 we have some incorrect
> records, and in B5 we have a bad checksum.
> 
> First we visit B0.  Nothing obviously wrong there, so we proceed with
> the depth-first search of B.  We examine B1 and B2 via pointer[1] and
> pointer[2], respectively, and find nothing wrong.  Now we try
> pointer[3], which we follow to B3.
> 
> Then we get to B3's sibling pointer check.  Leftsib is ok, but when we
> move on to checking rightsib, the xfs_btree_increment returns EFSBADCRC
> because the pointer[4] in B0 points to a block that isn't in the btree.
> Here we want to consume the EFSBADCRC, so we set OFLAG_CORRUPT and
> continue walking the tree.
> 
> Next we try to walk pointer[4] in B0 and again hit a EFSBADCRC error.
> Again we set OFLAG_CORRUPT and continue walking the tree.
> 
> Then we try to walk pointer[5] in B0 and encounter B5.  The CRC matches,
> but the magic number is wrong, so we hit EFSCORRUPTED.  The block is
> toast, but we still need to keep walking.
> 
> Now we walk pointer[6] and encounter B6.  We encounter no operational
> errors but then we see some incorrect records so we set OFLAG_CORRUPT
> (it's still set) and continue.
> 
> Finally we get to pointer[7], where everything is fine again.  If we
> haven't encountered any operational problems like ENOMEM then we'll
> return to userspace with OFLAG_CORRUPT set, a return value of zero.
> 
> The ftrace buffer will have a report about the operational error trying
> to walk down pointer[4], another one about B5, record check failures
> from B6.
> 
> (If instead we run out of memory checking B7 then we'll return the
> ENOMEM to userspace.)
> 

Got it, thanks. Note that I'm not trying to suggest to deviate from this
behavior. It's just a refactoring such that -EFSCORRUPTED is always
consumed immediately from any external function that could generate it
and is translated into the local scrub state (i.e., corrupted = true
and/or OFLAG_CORRUPT, where the latter is a subset of the former).
Basically, think of it as instead of never returning -EFSCORRUPTED to
userspace, it should not be returned by any scrub function, ever.

> > return some other error for an error in the infrastructure, or clear any
> > -EFSCORRUPTED error generated by checks that don't necessarily mean the
> > current object is corrupted (or too much so to interrupt processing).
> > 
> > > > I'm wondering if this could all be made more clear by trying to
> > > > explicitly separate out operational errors, scrub failures and whatever
> > > > we want to call the logic that clears an -EFSCORRUPTED/-EFSBADCRC error
> > > > code but still indicates something happened. :P
> > > > 
> > > > For starters, rather than wrap every logic check with btree_op_check(),
> > > > could we use explicit logic and let each function update the context
> > > > based on problems it found? For example, something like the following is
> > > > much more easy to read for me than the associated logic above:
> > > > 
> > > > 	/* Don't try to check a tree with a height we can't handle. */
> > > > 	if (!(cur->bc_nlevels > 0 &&
> > > > 	      cur->bc_nlevels <= XFS_BTREE_MAXLEVELS)) {
> > > > 		xfs_scrub_sc_corrupt(...);
> > > > 		goto out;
> > > > 	}
> > > > 
> > > > And of course the context update calls could be factored into an
> > > > out_corrupt label or something where appropriate.
> > > 
> > > Yes, that could be done.
> > > 
> > > > Beyond that, where we need to identify a bit of metadata is busted to
> > > > perhaps do something like skip it but not abort (as we may have filtered
> > > > out an -EFSCORRUPTED) return code, could we pass a flag down a
> > > > particular callchain (i.e., think 'bool *bad' or 'int *stat' a la the
> > > > core btree code)? Then we can still transfer that state back up the
> > > > chain and the caller(s) can distinguish operational errors from "this
> > > > thing is corrupted, act accordingly," regardless of how the corruption
> > > > was detected.
> > > 
> > > So far I haven't needed to distinguish between "no problems encountered"
> > > and "this callchain hit a verifier error so we just set _CORRUPT" --
> > > scrub always keeps going until it runs out of things to check.
> > > 
> > 
> > Ok. This is partly speculation on the above (trying to wrap my head
> > around the error consumption bits as is) and partly to try and see if we
> > can make the flow more readable.
> > 
> > In my mind, this is more clear if return codes are reserved for
> > operational/infrastructure errors
> 
> Yes, this is true.
> 
> > and the corrupted state of a piece of metadata is its own state.
> 
> Also true -- this is OFLAG_CORRUPT.
> 
> > Using the example above, any -EFSCORRUPTED errors from external calls
> > (xfs_btree_check_block(), xfs_btree_increment(), etc.) would always be
> > cleared and replaced with a return 0.
> 
> <nod>
> 
> > The difference between those is the former (check_block()) error sets
> > a 'bad = true' state on the currently scrubbed bit of metadata and the
> > latter (check_sibling()) does not.
> 
> Nothing in the scrub code needs to track badness at that fine-grained of
> a level.  When we get to the repair patches you'll see that any kind of
> error triggers a complete rebuild of the btree index, with absolutely no
> attempt to touch the existing (inconsistent) btree.
> 

Perhaps, but that's not really the point. My argument is that it
simplifies the logic and thus makes the code more clear. :)

> We /could/ return to userspace as soon as we hit the first EFSCORRUPTED
> or failed check, TBH.
> 

That's a separate question that I haven't really thought much about yet
tbh. I suppose I could see use for a oneshot option or something for
users who may want to exit quickly in favor of offline repair rather
than wait for a thorough scan.

> > The latter can of course still set the global corrupted state on the
> > context to track that there is an inconsistency in the fs. Thoughts?
> 
> I've wondered if it might be clearer if we did something like:
> 
> int error;
> bool bailout;
> 
> error = xfs_btree_increment(...);
> bailout = xfs_scrub_op_error(..., &error);
> if (bailout)
> 	return error;
> 
> if (ptr->field == BADVAL) {
> 	xfs_scrub_corrupt(...);
> 	return error;
> }
> 

It still appears confusing to me because isn't whatever we were going to
do that depends on the xfs_btree_increment() call now bogus if the
increment itself fails, for whatever reason? When would you not bail out
of here on increment failure?

Using a more simple example of xfs_scrub_btree_block(), I'd expect it to
look something like this:

xfs_scrub_btree_block(..., bool *bad, ...)
{
	...

	error = xfs_btree_lookup_get_block(bs->cur, ...);
	if (error)
		goto out_err;

	xfs_btree_get_block(bs->cur, ...);
	error = xfs_btree_check_block(bs->cur, ...);
	if (error)
		goto out_err;

	error = xfs_scrub_btree_block_check_siblings(bs, ...);
	ASSERT(error != -EFSCORRUPTED);
	return error;

out_err:
	if (error == -EFSCORRUPTED) {
		*bad = true;
		error = 0;
	}
	return error;
}

check_siblings() doesn't need bad because it's checking the siblings. It
just sets OFLAG_CORRUPT if it finds corruption or returns a non
corruption error if one occurs. The caller sets OFLAG_CORRUPT if bad ==
true or bails if error != 0.

Brian

> <shrug>
> 
> --D
> 
> > 
> > Brian
> > 
> > > (Maybe I'm missing something?)
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > +	/* Load the root of the btree. */
> > > > > +	level = cur->bc_nlevels - 1;
> > > > > +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > > > > +	error = xfs_scrub_btree_block(&bs, level, &ptr, &block, &bp);
> > > > > +	if (!xfs_scrub_btree_op_ok(sc, cur, cur->bc_nlevels - 1, &error))
> > > > > +		goto out;
> > > > > +
> > > > > +	cur->bc_ptrs[level] = 1;
> > > > > +
> > > > > +	while (level < cur->bc_nlevels) {
> > > > > +		block = xfs_btree_get_block(cur, level, &bp);
> > > > > +
> > > > > +		if (level == 0) {
> > > > > +			/* End of leaf, pop back towards the root. */
> > > > > +			if (cur->bc_ptrs[level] >
> > > > > +			    be16_to_cpu(block->bb_numrecs)) {
> > > > > +				if (level < cur->bc_nlevels - 1)
> > > > > +					cur->bc_ptrs[level + 1]++;
> > > > > +				level++;
> > > > > +				continue;
> > > > > +			}
> > > > > +
> > > > > +			if (xfs_scrub_should_terminate(&error))
> > > > > +				break;
> > > > > +
> > > > > +			cur->bc_ptrs[level]++;
> > > > > +			continue;
> > > > > +		}
> > > > > +
> > > > > +		/* End of node, pop back towards the root. */
> > > > > +		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> > > > > +			if (level < cur->bc_nlevels - 1)
> > > > > +				cur->bc_ptrs[level + 1]++;
> > > > > +			level++;
> > > > > +			continue;
> > > > > +		}
> > > > > +
> > > > > +		/* Drill another level deeper. */
> > > > > +		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> > > > > +		error = xfs_scrub_btree_ptr(&bs, level, pp);
> > > > > +		if (error) {
> > > > > +			error = 0;
> > > > > +			cur->bc_ptrs[level]++;
> > > > > +			continue;
> > > > > +		}
> > > > > +		level--;
> > > > > +		error = xfs_scrub_btree_block(&bs, level, pp, &block, &bp);
> > > > > +		if (!xfs_scrub_btree_op_ok(sc, cur, level, &error))
> > > > > +			goto out;
> > > > > +
> > > > > +		cur->bc_ptrs[level] = 1;
> > > > > +	}
> > > > > +
> > > > > +out:
> > > > > +	return error;
> > > > >  }
> > > > > diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
> > > > > index e1bb14b..9920488 100644
> > > > > --- a/fs/xfs/scrub/common.h
> > > > > +++ b/fs/xfs/scrub/common.h
> > > > > @@ -20,6 +20,19 @@
> > > > >  #ifndef __XFS_SCRUB_COMMON_H__
> > > > >  #define __XFS_SCRUB_COMMON_H__
> > > > >  
> > > > > +/* Should we end the scrub early? */
> > > > > +static inline bool
> > > > > +xfs_scrub_should_terminate(
> > > > > +	int		*error)
> > > > > +{
> > > > > +	if (fatal_signal_pending(current)) {
> > > > > +		if (*error == 0)
> > > > > +			*error = -EAGAIN;
> > > > > +		return true;
> > > > > +	}
> > > > > +	return false;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Grab a transaction.  If we're going to repair something, we need to
> > > > >   * ensure there's enough reservation to make all the changes.  If not,
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html