Re: [PATCH 1/9] xfs_repair: port the online repair newbt structure

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 15 May 2020 15:43:43 -0400

On Fri, May 15, 2020 at 11:52:39AM -0700, Darrick J. Wong wrote:
> On Fri, May 15, 2020 at 07:41:16AM -0400, Brian Foster wrote:
> > On Thu, May 14, 2020 at 12:20:37PM -0700, Darrick J. Wong wrote:
> > > On Thu, May 14, 2020 at 11:09:33AM -0400, Brian Foster wrote:
> > > > On Sat, May 09, 2020 at 09:31:47AM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > 
> > > > > Port the new btree staging context and related block reservation helper
> > > > > code from the kernel to repair.  We'll use this in subsequent patches to
> > > > > implement btree bulk loading.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > ---
> > > > >  include/libxfs.h         |    1 
> > > > >  libxfs/libxfs_api_defs.h |    2 
> > > > >  repair/Makefile          |    4 -
> > > > >  repair/bload.c           |  276 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  repair/bload.h           |   79 +++++++++++++
> > > > >  repair/xfs_repair.c      |   17 +++
> > > > >  6 files changed, 377 insertions(+), 2 deletions(-)
> > > > >  create mode 100644 repair/bload.c
> > > > >  create mode 100644 repair/bload.h
> > > > > 
> > > > > 
> > > > ...
> > > > > diff --git a/repair/bload.c b/repair/bload.c
> > > > > new file mode 100644
> > > > > index 00000000..ab05815c
> > > > > --- /dev/null
> > > > > +++ b/repair/bload.c
> > > > > @@ -0,0 +1,276 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > +/*
> > > > > + * Copyright (C) 2020 Oracle.  All Rights Reserved.
> > > > > + * Author: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > + */
> > > > > +#include <libxfs.h>
> > > > > +#include "bload.h"
> > > > > +
> > > > > +#define trace_xrep_newbt_claim_block(...)	((void) 0)
> > > > > +#define trace_xrep_newbt_reserve_space(...)	((void) 0)
> > > > > +#define trace_xrep_newbt_unreserve_space(...)	((void) 0)
> > > > > +#define trace_xrep_newbt_claim_block(...)	((void) 0)
> > > > > +
> > > > > +int bload_leaf_slack = -1;
> > > > > +int bload_node_slack = -1;
> > > > > +
> > > > > +/* Ported routines from fs/xfs/scrub/repair.c */
> > > > > +
> > > > 
> > > > Any plans to generalize/lift more of this stuff into libxfs if it's
> > > > going to be shared with xfsprogs?
> > > 
> > > That depends on what the final online repair code looks like.
> > > I suspect it'll be different enough that it's not worth sharing, but I
> > > wouldn't be opposed to sharing identical functions.
> > > 
> > 
> > Ok, I was just going off the above note around porting existing code
> > from kernel scrub. I think it's reasonable to consider generalizations
> > later once both implementations are solidified.
> > 
> > > > ...
> > > > > +/* Free all the accounting infor and disk space we reserved for a new btree. */
> > > > > +void
> > > > > +xrep_newbt_destroy(
> > > > > +	struct xrep_newbt	*xnr,
> > > > > +	int			error)
> > > > > +{
> > > > > +	struct repair_ctx	*sc = xnr->sc;
> > > > > +	struct xrep_newbt_resv	*resv, *n;
> > > > > +
> > > > > +	if (error)
> > > > > +		goto junkit;
> > > > 
> > > > Could use a comment on why we skip block freeing here..
> > > 
> > > I wonder what was the original reason for that?
> > > 
> > > IIRC if we actually error out of btree rebuilds then we've done
> > > something totally wrong while setting up the btree loader, or the
> > > storage is so broken that writes failed.  Repair is just going to call
> > > do_error() to terminate (and leave us with a broken filesystem) so we
> > > could just terminate right there at the top.
> > > 
> > 
> > Indeed.
> 
> Bah, I just realized that you and I have already reviewed a lot of this
> stuff for the kernel, and apparently I never backported that. :(
> 

Ok, I thought that stuff was actually merged so I'm kind of confused at
this point. :P

> In looking at what's in the kernel now, I realized that in general,
> the xfs_btree_bload_compute_geometry function will estimate the correct
> number of blocks to reserve for the new btree, so all this code exists
> to deal with either (a) overestimates when rebuilding the free space
> btrees; or (b) the kernel encountering a runtime error (e.g. ENOMEM) and
> needing to back out everything it's done.
> 
> For repair, (a) is still a possibility.  (b) is not, since repair will
> abort, but on the other hand it'll be easier to review a patch to unify
> the two implementations if the code stays identical.
> 
> Looking even further ahead, I plan to add two more users of the bulk
> loader: rebuilders for the bmap btrees, and (even later) the realtime
> rmapbt.  It would be helpful to keep as much of the code the same
> between repair and scrub.
> 
> So for now we don't really need the ability to free an over-reservation,
> but in the longer run it will make unification more obvious.
> 

It's also easier to review code that's already been reviewed from the
kernel and is being carted over for reuse, so I think it makes sense to
keep things in sync for that reason as well.

> /me vaguely wonders if we ought to be reviewing both of these patchsets
> in parallel....
> 

Re: above. I thought that stuff was merged and the approach was to move
the code over for reuse between scrub/xfs_repair. In any event, I think
what would facilitate subsequent reviews is some explicit separation
between patches for shared code and repair-specific code as well as some
references in the cover letter for the source of the former if those
bits haven't landed in the kernel yet...

Brian

> > > > I'm also wondering if we can check error in the primary loop and kill
> > > > the label and duplicate loop, but I guess that depends on whether the
> > > > fields are always valid.
> > > 
> > > I think they are.
> > > 
> > > > > +
> > > > > +	list_for_each_entry_safe(resv, n, &xnr->reservations, list) {
> > > > > +		/* We don't have EFIs here so skip the EFD. */
> > > > > +
> > > > > +		/* Free every block we didn't use. */
> > > > > +		resv->fsbno += resv->used;
> > > > > +		resv->len -= resv->used;
> > > > > +		resv->used = 0;
> > > > > +
> > > > > +		if (resv->len > 0) {
> > > > > +			trace_xrep_newbt_unreserve_space(sc->mp,
> > > > > +					XFS_FSB_TO_AGNO(sc->mp, resv->fsbno),
> > > > > +					XFS_FSB_TO_AGBNO(sc->mp, resv->fsbno),
> > > > > +					resv->len, xnr->oinfo.oi_owner);
> > > > > +
> > > > > +			__libxfs_bmap_add_free(sc->tp, resv->fsbno, resv->len,
> > > > > +					&xnr->oinfo, true);
> > > 
> > > TBH for repair I don't even think we need this, since in theory we
> > > reserved *exactly* the correct number of blocks for the btree.  Hmm.
> > > 
> > 
> > Ok, well it would be good to clean up whether we remove it, clean it up
> > or perhaps document why we wouldn't look at the resv fields on error if
> > there turns out to be specific reason for that.
> 
> <nod>
> 
> > > > > +		}
> > > > > +
> > > > > +		list_del(&resv->list);
> > > > > +		kmem_free(resv);
> > > > > +	}
> > > > > +
> > > > > +junkit:
> > > > > +	list_for_each_entry_safe(resv, n, &xnr->reservations, list) {
> > > > > +		list_del(&resv->list);
> > > > > +		kmem_free(resv);
> > > > > +	}
> > > > > +
> > > > > +	if (sc->ip) {
> > > > > +		kmem_cache_free(xfs_ifork_zone, xnr->ifake.if_fork);
> > > > > +		xnr->ifake.if_fork = NULL;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > ...
> > > > > diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> > > > > index 9d72fa8e..8fbd3649 100644
> > > > > --- a/repair/xfs_repair.c
> > > > > +++ b/repair/xfs_repair.c
> > > > ...
> > > > > @@ -49,6 +52,8 @@ static char *o_opts[] = {
> > > > >  	[AG_STRIDE]		= "ag_stride",
> > > > >  	[FORCE_GEO]		= "force_geometry",
> > > > >  	[PHASE2_THREADS]	= "phase2_threads",
> > > > > +	[BLOAD_LEAF_SLACK]	= "debug_bload_leaf_slack",
> > > > > +	[BLOAD_NODE_SLACK]	= "debug_bload_node_slack",
> > > > 
> > > > Why the "debug_" in the option names?
> > > 
> > > These are debugging knobs; there's no reason why any normal user would
> > > want to override the automatic slack sizing algorithms.  I also
> > > refrained from documenting them in the manpage. :P
> > > 
> > 
> > Oh, Ok. Perhaps that explains why they aren't in the usage() either. ;)
> 
> Yup.
> 
> --D
> 
> > Brian
> > 
> > > However, the knobs have been useful for stress-testing w/ fstests.
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > >  	[O_MAX_OPTS]		= NULL,
> > > > >  };
> > > > >  
> > > > > @@ -260,6 +265,18 @@ process_args(int argc, char **argv)
> > > > >  		_("-o phase2_threads requires a parameter\n"));
> > > > >  					phase2_threads = (int)strtol(val, NULL, 0);
> > > > >  					break;
> > > > > +				case BLOAD_LEAF_SLACK:
> > > > > +					if (!val)
> > > > > +						do_abort(
> > > > > +		_("-o debug_bload_leaf_slack requires a parameter\n"));
> > > > > +					bload_leaf_slack = (int)strtol(val, NULL, 0);
> > > > > +					break;
> > > > > +				case BLOAD_NODE_SLACK:
> > > > > +					if (!val)
> > > > > +						do_abort(
> > > > > +		_("-o debug_bload_node_slack requires a parameter\n"));
> > > > > +					bload_node_slack = (int)strtol(val, NULL, 0);
> > > > > +					break;
> > > > >  				default:
> > > > >  					unknown('o', val);
> > > > >  					break;
> > > > > 
> > > > 
> > > 
> > 
>