Re: [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 1 Jun 2020 21:30:52 -0700

On Tue, Jun 02, 2020 at 07:42:22AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> In tracking down a problem in this patchset, I discovered we are
> reclaiming dirty stale inodes. This wasn't discovered until inodes
> were always attached to the cluster buffer and then the rcu callback
> that freed inodes was assert failing because the inode still had an
> active pointer to the cluster buffer after it had been reclaimed.
> 
> Debugging the issue indicated that this was a pre-existing issue
> resulting from the way the inodes are handled in xfs_inactive_ifree.
> When we free a cluster buffer from xfs_ifree_cluster, all the inodes
> in cache are marked XFS_ISTALE. Those that are clean have nothing
> else done to them and so eventually get cleaned up by background
> reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
> XFS_ISTALE.
> 
> On journal commit dirty stale inodes as are handled by both
> buffer and inode log items to run though xfs_istale_done() and
> removed from the AIL (buffer log item commit) or the log item will
> simply unpin it because the buffer log item will clean it. What happens
> to any specific inode is entirely dependent on which log item wins
> the commit race, but the result is the same - stale inodes are
> clean, not attached to the cluster buffer, and not in the AIL. Hence
> inode reclaim can just free these inodes without further care.
> 
> However, if the stale inode is relogged, it gets dirtied again and
> relogged into the CIL. Most of the time this isn't an issue, because
> relogging simply changes the inode's location in the current
> checkpoint. Problems arise, however, when the CIL checkpoints
> between two transactions in the xfs_inactive_ifree() deferops
> processing. This results in the XFS_ISTALE inode being redirtied
> and inserted into the CIL without any of the other stale cluster
> buffer infrastructure being in place.
> 
> Hence on journal commit, it simply gets unpinned, so it remains
> dirty in memory. Everything in inode writeback avoids XFS_ISTALE
> inodes so it can't be written back, and it is not tracked in the AIL
> so there's not even a trigger to attempt to clean the inode. Hence
> the inode just sits dirty in memory until inode reclaim comes along,
> sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
> of a dirty inode caused use after free, list corruptions and other
> nasty issues later in this patchset.
> 
> Hence this patch addresses a violation of the "never log XFS_ISTALE
> inodes" caused by the deferops processing rolling a transaction
> and relogging a stale inode in xfs_inactive_free. It also adds a
> bunch of asserts to catch this problem in debug kernels so that
> we don't reintroduce this problem in future.
> 
> Reproducer for this issue was generic/558 on a v4 filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |  2 ++
>  fs/xfs/xfs_icache.c             |  3 ++-
>  fs/xfs/xfs_inode.c              | 25 ++++++++++++++++++++++---
>  3 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..4504d215cd590 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -36,6 +36,7 @@ xfs_trans_ijoin(
>  
>  	ASSERT(iip->ili_lock_flags == 0);
>  	iip->ili_lock_flags = lock_flags;
> +	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
>  	/*
>  	 * Get a log_item_desc to point at the new item.
> @@ -89,6 +90,7 @@ xfs_trans_log_inode(
>  
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
>  	/*
>  	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0a5ac6f9a5834..dbba4c1946386 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1141,7 +1141,7 @@ xfs_reclaim_inode(
>  			goto out_ifunlock;
>  		xfs_iunpin_wait(ip);
>  	}
> -	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
> +	if (xfs_inode_clean(ip)) {
>  		xfs_ifunlock(ip);
>  		goto reclaim;
>  	}
> @@ -1228,6 +1228,7 @@ xfs_reclaim_inode(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_qm_dqdetach(ip);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	ASSERT(xfs_inode_clean(ip));
>  
>  	__xfs_inode_free(ip);
>  	return error;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 64f5f9a440aed..53a1d64782c35 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1740,10 +1740,31 @@ xfs_inactive_ifree(
>  		return error;
>  	}
>  
> +	/*
> +	 * We do not hold the inode locked across the entire rolling transaction
> +	 * here. We only need to hold it for the first transaction that
> +	 * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the
> +	 * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode
> +	 * here breaks the relationship between cluster buffer invalidation and
> +	 * stale inode invalidation on cluster buffer item journal commit
> +	 * completion, and can result in leaving dirty stale inodes hanging
> +	 * around in memory.
> +	 *
> +	 * We have no need for serialising this inode operation against other
> +	 * operations - we freed the inode and hence reallocation is required
> +	 * and that will serialise on reallocating the space the deferops need
> +	 * to free. Hence we can unlock the inode on the first commit of
> +	 * the transaction rather than roll it right through the deferops. This
> +	 * avoids relogging the XFS_ISTALE inode.

Hmm.  What defer ops causes a transaction roll?  Is it the EFI that
frees the inode cluster blocks?

> +	 *
> +	 * We check that xfs_ifree() hasn't grown an internal transaction roll
> +	 * by asserting that the inode is still locked when it returns.
> +	 */
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, 0);
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);

This looks right to me since we should be marking the inode free in the
first transaction and therefore should not keep it attached to the
transaction...

Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>

--D

>  	error = xfs_ifree(tp, ip);
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	if (error) {
>  		/*
>  		 * If we fail to free the inode, shut down.  The cancel
> @@ -1756,7 +1777,6 @@ xfs_inactive_ifree(
>  			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
>  		}
>  		xfs_trans_cancel(tp);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		return error;
>  	}
>  
> @@ -1774,7 +1794,6 @@ xfs_inactive_ifree(
>  		xfs_notice(mp, "%s: xfs_trans_commit returned error %d",
>  			__func__, error);
>  
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return 0;
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
>