Re: [PATCH 05/22] xfs: scrub in-memory metadata buffers

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 24 Jul 2017 15:36:54 -0700

On Mon, Jul 24, 2017 at 11:43:27AM +1000, Dave Chinner wrote:
> On Thu, Jul 20, 2017 at 09:39:00PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > Call the verifier function for all in-memory metadata buffers, looking
> > for memory corruption either due to bad memory or coding bugs.
> 
> How does this fit into the bigger picture? We can't do an exhaustive
> search of the in memory buffer cache, because access is racy w.r.t.
> the life cycle of in memory buffers.
> 
> Also, if we are doing a full scrub, we're going to hit and then
> check the cached in-memory buffers anyway, so I'm missing the
> context that explains why this code is necessary.

Before we start scanning the filesystem (which could lead to clean
buffers being pushed out of memory and later reread), we want to check
the buffers that have been sitting around in memory to see if they've
mutated since the last time the verifiers ran.

> >  #endif	/* __XFS_REPAIR_COMMON_H__ */
> > diff --git a/fs/xfs/scrub/metabufs.c b/fs/xfs/scrub/metabufs.c
> > new file mode 100644
> > index 0000000..63faaa6
> > --- /dev/null
> > +++ b/fs/xfs/scrub/metabufs.c
> > @@ -0,0 +1,177 @@
> > +/*
> > + * Copyright (C) 2017 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_btree.h"
> > +#include "xfs_bit.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_trace.h"
> > +#include "xfs_sb.h"
> > +#include "scrub/common.h"
> > +
> > +/* We only iterate buffers one by one, so we don't need any setup. */
> > +int
> > +xfs_scrub_setup_metabufs(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	return 0;
> > +}
> > +
> > +#define XFS_SCRUB_METABUFS_TOO_MANY_RETRIES	10
> > +struct xfs_scrub_metabufs_info {
> > +	struct xfs_scrub_context	*sc;
> > +	unsigned int			retries;
> > +};
> 
> So do we get 10 retries per buffer, or 10 retries across an entire
> scan?

Ten per scan.

> > +/* In-memory buffer corruption. */
> > +
> > +#define XFS_SCRUB_BUF_OP_ERROR_GOTO(label) \
> > +	XFS_SCRUB_OP_ERROR_GOTO(smi->sc, \
> > +			xfs_daddr_to_agno(smi->sc->mp, bp->b_bn), \
> > +			xfs_daddr_to_agbno(smi->sc->mp, bp->b_bn), "buf", \
> > +			&error, label)
> 
> Nested macros - yuck!
> 
> > +STATIC int
> > +xfs_scrub_metabufs_scrub_buf(
> > +	struct xfs_scrub_metabufs_info	*smi,
> > +	struct xfs_buf			*bp)
> > +{
> > +	int				olderror;
> > +	int				error = 0;
> > +
> > +	/*
> > +	 * We hold the rcu lock during the rhashtable walk, so we can't risk
> > +	 * having the log forced due to a stale buffer by xfs_buf_lock.
> > +	 */
> > +	if (bp->b_flags & XBF_STALE)
> > +		return 0;
> > +
> > +	atomic_inc(&bp->b_hold);
> 
> This looks wrong. I think it can race with reclaim because we don't
> hold the pag->pag_buf_lock. i.e.  xfs_buf_rele() does this:
> 
> 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> 
> to prevent lookups - which are done under the pag->pag_buf_lock -
> from finding the buffer while it has a zero hold count and may be
> removed from the cache and freed.

I could be misunderstanding rhashtable here -- as I understand it, the
rhashtable_walk_start function calls rcu_read_lock and doesn't release
it until we call rhashtable_walk_stop.  The rhashtable lookup, insert,
and remove functions each call rcu_read_lock before fidding with the
hashtable internals.  I /think/ this means that even if the scrubber
gets ahold of a buffer with zero b_hold that's being xfs_buf_rele'd,
that concurrent xfs_buf_rele will be waiting for rcu_read_lock, and
therefore the buffer cannot be freed until the walk stops.

> Further, if we are going to iterate the cache, I'd much prefer the
> iteration code to be in fs/xfs/xfs_buf.c - nothing outside that file
> should be touching core buffer cache structures....
> 
> FWIW, the LRU walks already handle this problem by the fact the LRU
> owns a hold count on the buffer. So it may be better to do this via
> a LRU walk rather than a hashtable walk....

Yeah, once we clear up the above I'll move it to xfs_buf.c.

> [snip]
> 
> > +	olderror = bp->b_error;
> > +	if (bp->b_fspriv)
> > +		bp->b_ops->verify_write(bp);
> 
> Should we be recalculating the CRC on buffers we aren't about to 
> be writing to disk?

I don't think it causes any harm to recalculate the crc early.  If the
buffer is dirty and corrupt we can't fix it and write out will flag it
and shut down the fs anyway, so it likely doesn't matter anyway.

> Should we be verifying a buffer that has a non-zero error value on it?

No.

> > +	else
> > +		bp->b_ops->verify_read(bp);
> > +	error = bp->b_error;
> > +	bp->b_error = olderror;
> > +
> > +	/* Mark any corruption errors we might find. */
> > +	XFS_SCRUB_BUF_OP_ERROR_GOTO(out_unlock);
> 
> Ah, what? Why does this need a goto? And why doesn't it report the
> error that was found? (bloody macros!).
> 
> > +out_unlock:
> > +	xfs_buf_unlock(bp);
> > +out_dec:
> > +	atomic_dec(&bp->b_hold);
> > +	return error;
> > +}
> > +#undef XFS_SCRUB_BUF_OP_ERROR_GOTO
> 
> Oh, that's the macro defined above the function. Which I paid little
> attention to other than it called another macro. Now I realise that
> it (ab)uses local variables without them being passed into the
> macro. Yup, another reason we need to get rid of the macros from
> this code....

Ok.

> > +	struct xfs_scrub_metabufs_info	*smi,
> > +	struct rhashtable_iter		*iter)
> > +{
> > +	struct xfs_buf			*bp;
> > +	int				error = 0;
> > +
> > +	do {
> > +		if (xfs_scrub_should_terminate(&error))
> > +			break;
> > +
> > +		bp = rhashtable_walk_next(iter);
> > +		if (IS_ERR(bp))
> > +			return PTR_ERR(bp);
> > +		else if (bp == NULL)
> > +			return 0;
> > +
> > +		error = xfs_scrub_metabufs_scrub_buf(smi, bp);
> > +	} while (error != 0);
> > +
> > +	return error;
> > +}
> > +
> > +/* Try to walk the buffers in this AG in order to scrub them. */
> > +int
> > +xfs_scrub_metabufs(
> 
> Ah, please put an "_ag_" in this so it's clear it's only scrubbing a
> single AG. This is hidden deep inside the scrub context, so it took
> me a little bit of back tracking to understand that this wasn't a
> global scan....

Ok.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html