Re: Patch "xfs: fix unbalanced inode reclaim flush locking" has been added to the 4.9-stable tree

Zorro Lang <zlang@xxxxxxxxxx> · Tue, 10 Jan 2017 21:17:11 +0800

On Tue, Jan 10, 2017 at 11:33:05AM +0100, gregkh@xxxxxxxxxxxxxxxxxxx wrote:
> 
> This is a note to let you know that I've just added the patch titled
> 
>     xfs: fix unbalanced inode reclaim flush locking
> 
> to the 4.9-stable tree which can be found at:
>     http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
> 
> The filename of the patch is:
>      xfs-fix-unbalanced-inode-reclaim-flush-locking.patch
> and it can be found in the queue-4.9 subdirectory.
> 
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable@xxxxxxxxxxxxxxx> know about it.

I found this bug due to a new patch (backported from upstream) on RHEL-6 kernel
brought many regression issues, many xfstests cases hang or panic sometimes,
and this patch fixed them. So I think it's an useful patch, although not all
released kernel will hit these regression issues.

Thanks,
Zorro

> 
> 
> From hch@xxxxxx  Tue Jan 10 11:23:57 2017
> From: Christoph Hellwig <hch@xxxxxx>
> Date: Mon,  9 Jan 2017 16:38:38 +0100
> Subject: xfs: fix unbalanced inode reclaim flush locking
> To: stable@xxxxxxxxxxxxxxx
> Cc: linux-xfs@xxxxxxxxxxxxxxx, Brian Foster <bfoster@xxxxxxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>
> Message-ID: <1483976343-661-8-git-send-email-hch@xxxxxx>
> 
> 
> From: Brian Foster <bfoster@xxxxxxxxxx>
> 
> commit 98efe8af1c9ffac47e842b7a75ded903e2f028da upstream.
> 
> Filesystem shutdown testing on an older distro kernel has uncovered an
> imbalanced locking pattern for the inode flush lock in
> xfs_reclaim_inode(). Specifically, there is a double unlock sequence
> between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
> "reclaim:" label.
> 
> This actually does not cause obvious problems on current kernels due to
> the current flush lock implementation. Older kernels use a counting
> based flush lock mechanism, however, which effectively breaks the lock
> indefinitely when an already unlocked flush lock is repeatedly unlocked.
> Though this only currently occurs on filesystem shutdown, it has
> reproduced the effect of elevating an fs shutdown to a system-wide crash
> or hang.
> 
> As it turns out, the flush lock is not actually required for the reclaim
> logic in xfs_reclaim_inode() because by that time we have already cycled
> the flush lock once while holding ILOCK_EXCL. Therefore, remove the
> additional flush lock/unlock cycle around the 'reclaim:' label and
> update branches into this label to release the flush lock where
> appropriate. Add an assert to xfs_ifunlock() to help prevent future
> occurences of the same problem.
> 
> Reported-by: Zorro Lang <zlang@xxxxxxxxxx>
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> ---
>  fs/xfs/xfs_icache.c |   27 ++++++++++++++-------------
>  fs/xfs/xfs_inode.h  |   11 ++++++-----
>  2 files changed, 20 insertions(+), 18 deletions(-)
> 
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -123,7 +123,6 @@ __xfs_inode_free(
>  {
>  	/* asserts to verify all state is correct here */
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> -	ASSERT(!xfs_isiflocked(ip));
>  	XFS_STATS_DEC(ip->i_mount, vn_active);
>  
>  	call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
> @@ -133,6 +132,8 @@ void
>  xfs_inode_free(
>  	struct xfs_inode	*ip)
>  {
> +	ASSERT(!xfs_isiflocked(ip));
> +
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always
>  	 * appears to be reclaimed with an invalid inode number when in the
> @@ -981,6 +982,7 @@ restart:
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
> +		/* xfs_iflush_abort() drops the flush lock */
>  		xfs_iflush_abort(ip, false);
>  		goto reclaim;
>  	}
> @@ -989,10 +991,10 @@ restart:
>  			goto out_ifunlock;
>  		xfs_iunpin_wait(ip);
>  	}
> -	if (xfs_iflags_test(ip, XFS_ISTALE))
> -		goto reclaim;
> -	if (xfs_inode_clean(ip))
> +	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
> +		xfs_ifunlock(ip);
>  		goto reclaim;
> +	}
>  
>  	/*
>  	 * Never flush out dirty data during non-blocking reclaim, as it would
> @@ -1030,25 +1032,24 @@ restart:
>  		xfs_buf_relse(bp);
>  	}
>  
> -	xfs_iflock(ip);
>  reclaim:
> +	ASSERT(!xfs_isiflocked(ip));
> +
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always appears
>  	 * to be reclaimed with an invalid inode number when in the free state.
> -	 * We do this as early as possible under the ILOCK and flush lock so
> -	 * that xfs_iflush_cluster() can be guaranteed to detect races with us
> -	 * here. By doing this, we guarantee that once xfs_iflush_cluster has
> -	 * locked both the XFS_ILOCK and the flush lock that it will see either
> -	 * a valid, flushable inode that will serialise correctly against the
> -	 * locks below, or it will see a clean (and invalid) inode that it can
> -	 * skip.
> +	 * We do this as early as possible under the ILOCK so that
> +	 * xfs_iflush_cluster() can be guaranteed to detect races with us here.
> +	 * By doing this, we guarantee that once xfs_iflush_cluster has locked
> +	 * XFS_ILOCK that it will see either a valid, flushable inode that will
> +	 * serialise correctly, or it will see a clean (and invalid) inode that
> +	 * it can skip.
>  	 */
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
> -	xfs_ifunlock(ip);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  
>  	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -246,6 +246,11 @@ static inline bool xfs_is_reflink_inode(
>   * Synchronize processes attempting to flush the in-core inode back to disk.
>   */
>  
> +static inline int xfs_isiflocked(struct xfs_inode *ip)
> +{
> +	return xfs_iflags_test(ip, XFS_IFLOCK);
> +}
> +
>  extern void __xfs_iflock(struct xfs_inode *ip);
>  
>  static inline int xfs_iflock_nowait(struct xfs_inode *ip)
> @@ -261,16 +266,12 @@ static inline void xfs_iflock(struct xfs
>  
>  static inline void xfs_ifunlock(struct xfs_inode *ip)
>  {
> +	ASSERT(xfs_isiflocked(ip));
>  	xfs_iflags_clear(ip, XFS_IFLOCK);
>  	smp_mb();
>  	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
>  }
>  
> -static inline int xfs_isiflocked(struct xfs_inode *ip)
> -{
> -	return xfs_iflags_test(ip, XFS_IFLOCK);
> -}
> -
>  /*
>   * Flags for inode locking.
>   * Bit ranges:	1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
> 
> 
> Patches currently in stable-queue which might be from hch@xxxxxx are
> 
> queue-4.9/xfs-always-succeed-when-deduping-zero-bytes.patch
> queue-4.9/xfs-fix-crash-and-data-corruption-due-to-removal-of-busy-cow-extents.patch
> queue-4.9/xfs-don-t-allow-di_size-with-high-bit-set.patch
> queue-4.9/xfs-new-inode-extent-list-lookup-helpers.patch
> queue-4.9/xfs-don-t-call-xfs_sb_quota_from_disk-twice.patch
> queue-4.9/xfs-factor-rmap-btree-size-into-the-indlen-calculations.patch
> queue-4.9/xfs-check-return-value-of-_trans_reserve_quota_nblks.patch
> queue-4.9/xfs-complain-if-we-don-t-get-nextents-bmap-records.patch
> queue-4.9/xfs-check-for-bogus-values-in-btree-block-headers.patch
> queue-4.9/xfs-use-gpf_nofs-when-allocating-btree-cursors.patch
> queue-4.9/xfs-fix-max_retries-_show-and-_store-functions.patch
> queue-4.9/xfs-fix-double-cleanup-when-cui-recovery-fails.patch
> queue-4.9/xfs-don-t-skip-cow-forks-w-delalloc-blocks-in-cowblocks-scan.patch
> queue-4.9/xfs-track-preallocation-separately-in-xfs_bmapi_reserve_delalloc.patch
> queue-4.9/xfs-use-the-actual-ag-length-when-reserving-blocks.patch
> queue-4.9/xfs-ignore-leaf-attr-ichdr.count-in-verifier-during-log-replay.patch
> queue-4.9/xfs-pass-post-eof-speculative-prealloc-blocks-to-bmapi.patch
> queue-4.9/xfs-don-t-cap-maximum-dedupe-request-length.patch
> queue-4.9/xfs-pass-state-not-whichfork-to-trace_xfs_extlist.patch
> queue-4.9/xfs-move-agi-buffer-type-setting-to-xfs_read_agi.patch
> queue-4.9/xfs-check-minimum-block-size-for-crc-filesystems.patch
> queue-4.9/xfs-handle-cow-fork-in-xfs_bmap_trace_exlist.patch
> queue-4.9/pci-msi-check-for-null-affinity-mask-in-pci_irq_get_affinity.patch
> queue-4.9/xfs-error-out-if-trying-to-add-attrs-and-anextents-0.patch
> queue-4.9/xfs-don-t-bug-on-mixed-direct-and-mapped-i-o.patch
> queue-4.9/xfs-use-new-extent-lookup-helpers-xfs_file_iomap_begin_delay.patch
> queue-4.9/xfs-fix-unbalanced-inode-reclaim-flush-locking.patch
> queue-4.9/genirq-affinity-fix-node-generation-from-cpumask.patch
> queue-4.9/xfs-use-new-extent-lookup-helpers-in-__xfs_reflink_reserve_cow.patch
> queue-4.9/xfs-don-t-crash-if-reading-a-directory-results-in-an-unexpected-hole.patch
> queue-4.9/xfs-remove-prev-argument-to-xfs_bmapi_reserve_delalloc.patch
> queue-4.9/xfs-clean-up-cow-fork-reservation-and-tag-inodes-correctly.patch
> queue-4.9/xfs-forbid-ag-btrees-with-level-0.patch
> queue-4.9/xfs-provide-helper-for-counting-extents-from-if_bytes.patch
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html