Re: [PATCH 2/3] xfs: fix an inode lookup race in xchk_get_inode

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 15 Nov 2022 14:49:54 +1100

On Sun, Oct 02, 2022 at 11:20:29AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> In commit d658e, we tried to improve the robustnes of xchk_get_inode in
> the face of EINVAL returns from iget by calling xfs_imap to see if the
> inobt itself thinks that the inode is allocated.  Unfortunately, that
> commit didn't consider the possibility that the inode gets allocated
> after iget but before imap.  In this case, the imap call will succeed,
> but we turn that into a corruption error and tell userspace the inode is
> corrupt.
> 
> Avoid this false corruption report by grabbing the AGI header and
> retrying the iget before calling imap.  If the iget succeeds, we can
> proceed with the usual scrub-by-handle code.  Fix all the incorrect
> comments too, since unreadable/corrupt inodes no longer result in EINVAL
> returns.
> 
> Fixes: d658e72b4a09 ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
> Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>

OK.

> ---
>  fs/xfs/scrub/common.c |  203 ++++++++++++++++++++++++++++++++++++++++---------
>  fs/xfs/scrub/common.h |    4 +
>  fs/xfs/xfs_icache.c   |    3 -
>  fs/xfs/xfs_icache.h   |    1 
>  4 files changed, 172 insertions(+), 39 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index 42a25488bd25..9a811c5fa8f7 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -635,6 +635,14 @@ xchk_ag_init(
>  
>  /* Per-scrubber setup functions */
>  
> +void
> +xchk_trans_cancel(
> +	struct xfs_scrub	*sc)
> +{
> +	xfs_trans_cancel(sc->tp);
> +	sc->tp = NULL;
> +}
> +
>  /*
>   * Grab an empty transaction so that we can re-grab locked buffers if
>   * one of our btrees turns out to be cyclic.
> @@ -720,6 +728,80 @@ xchk_iget(
>  	return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp);
>  }
>  
> +/*
> + * Try to grab an inode.  If we can't, return the AGI buffer so that the caller
> + * can single-step the loading process to see where things went wrong.

"Try to grab an inode in a manner that avoids races with physical inode
allocation. If we can't, return the locked AGI buffer so that the
caller can single-step the loading process to see where things went
wrong."

> + *
> + * If the iget succeeds, return 0, a NULL AGI, and the inode.
> + * If the iget fails, return the error, the AGI, and a NULL inode.  This can

"... the locked AGI, ...."

> + * include -EINVAL and -ENOENT for invalid inode numbers or inodes that are no
> + * longer allocated; or any other corruption or runtime error.
> + * If the AGI read fails, return the error, a NULL AGI, and NULL inode.
> + * If a fatal signal is pending, return -EINTR, a NULL AGI, and a NULL inode.
> + */
> +int
> +xchk_iget_agi(
> +	struct xfs_scrub	*sc,
> +	xfs_ino_t		inum,
> +	struct xfs_buf		**agi_bpp,
> +	struct xfs_inode	**ipp)
> +{
> +	struct xfs_mount	*mp = sc->mp;
> +	struct xfs_trans	*tp = sc->tp;
> +	struct xfs_perag	*pag;
> +	int			error;
> +
> +again:
> +	*agi_bpp = NULL;
> +	*ipp = NULL;
> +	error = 0;
> +
> +	if (xchk_should_terminate(sc, &error))
> +		return error;
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
> +	error = xfs_ialloc_read_agi(pag, tp, agi_bpp);
> +	xfs_perag_put(pag);
> +	if (error)
> +		return error;
> +
> +	error = xfs_iget(mp, tp, inum,
> +			XFS_IGET_NOWAIT | XFS_IGET_UNTRUSTED, 0, ipp);

OK, IGET_NOWAIT is new....

> +#define XFS_IGET_NOWAIT		0x10	/* return EAGAIN instead of blocking */

But it doesn't prevent blocking. XFS_IGET_UNTRUSTED means we do a
inobt record lookup (btree buffer locking and IO that can block) as
well as reading the inode cluster from disk if it's not already in
cache. Hence this isn't what I'd call a "non-blocking" or "no wait"
operation. 

AFAICT, what is needed here is avoiding the -retry mechanism- of the
lookup, not "non blocking/no wait" semantics. i.e. this seems
reasonable to get an -EAGAIN error instead of delaying and trying
again if we are using XFS_IGET_NORETRY semantics....

> +	if (error == -EAGAIN) {
> +		/*
> +		 * Cached inode awaiting inactivation.  Drop the AGI buffer in
> +		 * case the inactivation worker is now waiting for it, and try
> +		 * the iget again.
> +		 */

That's not the only reason xfs_iget() could return -EAGAIN,
right? radix_tree_preload() failing can cause -EAGAIN to be returned
from xfs_iget_cache_miss(), as can an instantiation race inserting
the new inode into the radix tree. The cache hit path has a bunch
more cases, too. Perhaps:

		/*
		 * The inode may be in core but temporarily
		 * unavailable and may require the AGI buffer before
		 * it can be returned. Drop the AGI buffer and retry
		 * the lookup.
		 */

> +		xfs_trans_brelse(tp, *agi_bpp);
> +		delay(1);
> +		goto again;
> +	}
> +	if (error == 0) {
> +		/* We got the inode, so we can release the AGI. */
> +		ASSERT(*ipp != NULL);
> +		xfs_trans_brelse(tp, *agi_bpp);
> +		*agi_bpp = NULL;
> +	}
> +
> +	return error;

That's better written as:

	if (error)
		return error;

	/* We got the inode, so we can release the AGI. */
	ASSERT(*ipp != NULL);
	xfs_trans_brelse(tp, *agi_bpp);
	*agi_bpp = NULL;
	return 0;

Other than that, the code makes sense.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx