Re: [PATCH v2 18/21] xfs: cross-reference the rmapbt data with the refcountbt

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 16 Jan 2018 11:47:04 -0800

On Mon, Jan 15, 2018 at 10:49:55PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 16, 2018 at 10:49:28AM +1100, Dave Chinner wrote:
> > On Tue, Jan 09, 2018 at 01:25:17PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > 
> > > Cross reference the refcount data with the rmap data to check that the
> > > number of rmaps for a given block match the refcount of that block, and
> > > that CoW blocks (which are owned entirely by the refcountbt) are tracked
> > > as well.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > ---
> > > v2: streamline scrubber arguments, remove stack allocated objects
> > > ---
> > >  fs/xfs/scrub/refcount.c |  318 +++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 316 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
> > > index 700f8f1..df18e47 100644
> > > --- a/fs/xfs/scrub/refcount.c
> > > +++ b/fs/xfs/scrub/refcount.c
> > > @@ -50,6 +50,274 @@ xfs_scrub_setup_ag_refcountbt(
> > >  
> > >  /* Reference count btree scrubber. */
> > >  
> > > +/*
> > > + * Confirming Reference Counts via Reverse Mappings
> > > + *
> > > + * We want to count the reverse mappings overlapping a refcount record
> > > + * (bno, len, refcount), allowing for the possibility that some of the
> > > + * overlap may come from smaller adjoining reverse mappings, while some
> > > + * comes from single extents which overlap the range entirely.  The
> > > + * outer loop is as follows:
> > > + *
> > > + * 1. For all reverse mappings overlapping the refcount extent,
> > > + *    a. If a given rmap completely overlaps, mark it as seen.
> > > + *    b. Otherwise, record the fragment for later processing.
> > > + *
> > > + * Once we've seen all the rmaps, we know that for all blocks in the
> > > + * refcount record we want to find $refcount owners and we've already
> > > + * visited $seen extents that overlap all the blocks.  Therefore, we
> > > + * need to find ($refcount - $seen) owners for every block in the
> > > + * extent; call that quantity $target_nr.  Proceed as follows:
> > > + *
> > > + * 2. Pull the first $target_nr fragments from the list; all of them
> > > + *    should start at or before the start of the extent.
> > > + *    Call this subset of fragments the working set.
> > > + * 3. Until there are no more unprocessed fragments,
> > > + *    a. Find the shortest fragments in the set and remove them.
> > > + *    b. Note the block number of the end of these fragments.
> > > + *    c. Pull the same number of fragments from the list.  All of these
> > > + *       fragments should start at the block number recorded in the
> > > + *       previous step.
> > > + *    d. Put those fragments in the set.
> > > + * 4. Check that there are $target_nr fragments remaining in the list,
> > > + *    and that they all end at or beyond the end of the refcount extent.
> > > + *
> > > + * If the refcount is correct, all the check conditions in the algorithm
> > > + * should always hold true.  If not, the refcount is incorrect.
> > 
> > This needs a comment somewhere in here describing the order of
> > the records on the fragment list. AFAICT, it's ordered by start
> > bno, but I'm not 100% sure and it seems the code is dependent on
> > strict ordering of records the frag list....
> 
> Yes, the list must be ordered by agbno, which should be the case if we
> iterated the rmap records in order.  We /could/ potentially list_sort
> to ensure that this code doesn't blow up even if the rmapbt decides to
> feed us out of order garbage.
> 
> Could?  Should.

Hm.  In theory the rmap should never feed us out of order rmaps, but if
it does that's an xref corruption, so we can employ a computationally
less expensive order check (over list_sort) and bail out if the list
is out of order.

> > .....
> > 
> > > +	} else {
> > > +		/*
> > > +		 * This rmap covers only part of the refcount record, so
> > > +		 * save the fragment for later processing.
> > > +		 */
> > > +		frag = kmem_alloc(sizeof(struct xfs_scrub_refcnt_frag),
> > > +				KM_MAYFAIL | KM_NOFS);
> > > +		if (!frag)
> > > +			return -ENOMEM;
> > > +		memcpy(&frag->rm, rec, sizeof(frag->rm));
> > > +		list_add_tail(&frag->list, &refchk->fragments);
> > 
> > I'm making the assumption here that we're seeing records in the
> > order they are in the rmap tree and that it's in increase startblock
> > order, hence the list is ordered that way....
> > 
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Given a bunch of rmap fragments, iterate through them, keeping
> > > + * a running tally of the refcount.  If this ever deviates from
> > > + * what we expect (which is the refcountbt's refcount minus the
> > > + * number of extents that totally covered the refcountbt extent),
> > > + * we have a refcountbt error.
> > > + */
> > > +STATIC void
> > > +xfs_scrub_refcountbt_process_rmap_fragments(
> > > +	struct xfs_scrub_refcnt_check	*refchk)
> > > +{
> > > +	struct list_head		worklist;
> > > +	struct xfs_scrub_refcnt_frag	*frag;
> > > +	struct xfs_scrub_refcnt_frag	*n;
> > > +	xfs_agblock_t			bno;
> > > +	xfs_agblock_t			rbno;
> > > +	xfs_agblock_t			next_rbno;
> > > +	xfs_nlink_t			nr;
> > > +	xfs_nlink_t			target_nr;
> > > +
> > > +	target_nr = refchk->refcount - refchk->seen;
> > > +	if (target_nr == 0)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * There are (refchk->rc.rc_refcount - refchk->nr refcount)
> > > +	 * references we haven't found yet.  Pull that many off the
> > > +	 * fragment list and figure out where the smallest rmap ends
> > > +	 * (and therefore the next rmap should start).  All the rmaps
> > > +	 * we pull off should start at or before the beginning of the
> > > +	 * refcount record's range.
> > > +	 */
> > > +	INIT_LIST_HEAD(&worklist);
> > > +	rbno = NULLAGBLOCK;
> > > +	nr = 1;
> > > +
> > > +	/* Find all the rmaps that start at or before the refc extent. */
> > > +	list_for_each_entry_safe(frag, n, &refchk->fragments, list) {
> > > +		if (frag->rm.rm_startblock > refchk->bno)
> > > +			goto done;
> > 
> > .... and is where the code implies lowest to highest startblock
> > ordering on the frag list.
> > 
> > > +		bno = frag->rm.rm_startblock + frag->rm.rm_blockcount;
> > > +		if (rbno > bno)
> > > +			rbno = bno;
> > 
> > Can we put that check the other way around? we're looking for the
> > shortest/smallest end block, so
> > 
> > 		if (bno < rbno)
> > 			rbno = bno;
> > 
> > Makes a lot more sense to me.
> 
> Ok.
> 
> > 
> > > +		list_del(&frag->list);
> > > +		list_add_tail(&frag->list, &worklist);
> > 
> > list_move_tail()?
> > 
> > Ok, so we are moving fragments that start before the recount bno to
> > the work list.
> 
> <nod>
> 
> > > +		if (nr == target_nr)
> > > +			break;
> > > +		nr++;
> > > +	}
> > > +
> > > +	/*
> > > +	 * We should have found exactly $target_nr rmap fragments starting
> > > +	 * at or before the refcount extent.
> > > +	 */
> > > +	if (nr != target_nr)
> > > +		goto done;
> > 
> > ok. so on error we clean up and free the frag list and work list....
> > 
> > > +
> > > +	while (!list_empty(&refchk->fragments)) {
> > > +		/* Discard any fragments ending at rbno. */
> > > +		nr = 0;
> > > +		next_rbno = NULLAGBLOCK;
> > > +		list_for_each_entry_safe(frag, n, &worklist, list) {
> > 
> > Ok, this needs to be clearer than it's walking the working set
> > of fragments. I had to read this code several times before I worked
> > out this is where the "working set" was being processed....
> 
> /* Walk the working set of rmap fragments... */

Correction: just update the comment to:
/* Discard any fragments ending at rbno from the worklist. */

--D

> 
> > 
> > > +			bno = frag->rm.rm_startblock + frag->rm.rm_blockcount;
> > > +			if (bno != rbno) {
> > > +				if (next_rbno > bno)
> > > +					next_rbno = bno;
> > 
> > Same comment here about being a next_rbno "smallest bno" variable.
> 
> <nod>
> 
> > > +				continue;
> > > +			}
> > > +			list_del(&frag->list);
> > > +			kmem_free(frag);
> > > +			nr++;
> > > +		}
> > > +
> > > +		/* Empty list?  We're done. */
> > > +		if (list_empty(&refchk->fragments))
> > > +			break;
> > > +
> > > +		/* Try to add nr rmaps starting at rbno to the worklist. */
> > > +		list_for_each_entry_safe(frag, n, &refchk->fragments, list) {
> > > +			bno = frag->rm.rm_startblock + frag->rm.rm_blockcount;
> > > +			if (frag->rm.rm_startblock != rbno)
> > > +				goto done;
> > 
> > definitely assuming the frag list is ordered here :P
> > 
> > > +			list_del(&frag->list);
> > > +			list_add_tail(&frag->list, &worklist);
> > > +			if (next_rbno > bno)
> > > +				next_rbno = bno;
> > > +			nr--;
> > > +			if (nr == 0)
> > > +				break;
> > > +		}
> > 
> > Ok, so if we get here with nr > 0, then we must have emptied the
> > fragment list onto the work list, right? At this point, the outer
> > loop will terminate. Don't we need to run the worklist processing
> > loop one last time?
> 
> Nope.  Throughout xfs_scrub_refcountbt_process_rmap_fragments, we're
> checking that the number of rmaps for a given refcount extent (i.e.
> target_nr) remains the same.  nr is the number of rmaps we discarded
> from the worklist at the top of the loop, so if we can't add the exact
> same number of rmaps back to the worklist then we know that the refcount
> is wrong.
> 
> So there /is/ a bug here, and the bug is that we need "if (nr) break;"
> because we don't need to process more of the loop, we already know the
> refcountbt is broken.
> 
> > > +		rbno = next_rbno;
> > > +	}
> > 
> > .....
> > > +/* Use the rmap entries covering this extent to verify the refcount. */
> > > +STATIC void
> > > +xfs_scrub_refcountbt_xref_rmap(
> > > +	struct xfs_scrub_context	*sc,
> > > +	xfs_agblock_t			bno,
> > > +	xfs_extlen_t			len,
> > > +	xfs_nlink_t			refcount)
> > > +{
> > > +	struct xfs_scrub_refcnt_check	refchk = {
> > > +		.sc = sc,
> > > +		.bno = bno,
> > > +		.len = len,
> > > +		.refcount = refcount,
> > > +		.seen = 0,
> > > +	};
> > > +	struct xfs_rmap_irec		low;
> > > +	struct xfs_rmap_irec		high;
> > > +	struct xfs_scrub_refcnt_frag	*frag;
> > > +	struct xfs_scrub_refcnt_frag	*n;
> > > +	int				error;
> > > +
> > > +	if (!sc->sa.rmap_cur)
> > > +		return;
> > > +
> > > +	/* Cross-reference with the rmapbt to confirm the refcount. */
> > > +	memset(&low, 0, sizeof(low));
> > > +	low.rm_startblock = bno;
> > > +	memset(&high, 0xFF, sizeof(high));
> > > +	high.rm_startblock = bno + len - 1;
> > 
> > This range query init feels like a familiar pattern now. Helper
> > function (separate patch)?
> 
> Yeah, these can all get cleaned up at the end of the series.
> 
> > ....
> > 
> > > +/* Make sure we have as many refc blocks as the rmap says. */
> > > +STATIC void
> > > +xfs_scrub_refcount_xref_rmap(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_owner_info		*oinfo,
> > > +	xfs_filblks_t			cow_blocks)
> > > +{
> > > +	xfs_extlen_t			refcbt_blocks = 0;
> > > +	xfs_filblks_t			blocks;
> > > +	int				error;
> > > +
> > > +	/* Check that we saw as many refcbt blocks as the rmap knows about. */
> > > +	error = xfs_btree_count_blocks(sc->sa.refc_cur, &refcbt_blocks);
> > > +	if (!xfs_scrub_btree_process_error(sc, sc->sa.refc_cur, 0, &error))
> > > +		return;
> > > +	error = xfs_scrub_count_rmap_ownedby_ag(sc, sc->sa.rmap_cur, oinfo,
> > > +			&blocks);
> > > +	if (xfs_scrub_should_check_xref(sc, &error, &sc->sa.rmap_cur) &&
> > > +	    blocks != refcbt_blocks)
> > > +		xfs_scrub_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
> > > +
> > > +	if (!sc->sa.rmap_cur)
> > > +		return;
> > > +
> > > +	/* Check that we saw as many cow blocks as the rmap knows about. */
> > > +	xfs_rmap_ag_owner(oinfo, XFS_RMAP_OWN_COW);
> > > +	error = xfs_scrub_count_rmap_ownedby_ag(sc, sc->sa.rmap_cur, oinfo,
> > > +			&blocks);
> > > +	if (xfs_scrub_should_check_xref(sc, &error, &sc->sa.rmap_cur) &&
> > > +	    blocks != cow_blocks)
> > > +		xfs_scrub_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
> > 
> > Bit of a landmine that this code changes the owner info structure
> > that was passed in....
> 
> Will fix.
> 
> > > +}
> > > +
> > >  /* Scrub the refcount btree for some AG. */
> > >  int
> > >  xfs_scrub_refcountbt(
> > >  	struct xfs_scrub_context	*sc)
> > >  {
> > >  	struct xfs_owner_info		oinfo;
> > > +	xfs_agblock_t			cow_blocks = 0;
> > > +	int				error;
> > >  
> > >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> > > -	return xfs_scrub_btree(sc, sc->sa.refc_cur, xfs_scrub_refcountbt_rec,
> > > -			&oinfo, NULL);
> > > +	error = xfs_scrub_btree(sc, sc->sa.refc_cur, xfs_scrub_refcountbt_rec,
> > > +			&oinfo, &cow_blocks);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	if (sc->sa.rmap_cur)
> > > +		xfs_scrub_refcount_xref_rmap(sc, &oinfo, cow_blocks);
> > 
> > .... because that's not obvious in this code here if we add anymore
> > code after this call.
> 
> <nod>
> 
> > > +
> > > +	return error;
> > 
> > error is zero here, so "return 0" instead?
> 
> Ok.
> 
> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > 
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html