Re: [PATCH] NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 28 Jan 2014 08:38:29 -0500

On Mon, 27 Jan 2014 13:46:15 -0500
Jeff Layton <jlayton@xxxxxxxxxx> wrote:

> There is a possible race in how the nfs_invalidate_mapping function is
> handled.  Currently, we go and invalidate the pages in the file and then
> clear NFS_INO_INVALID_DATA.
> 
> The problem is that it's possible for a stale page to creep into the
> mapping after the page was invalidated (i.e., via readahead). If another
> writer comes along and sets the flag after that happens but before
> invalidate_inode_pages2 returns then we could clear the flag
> without the cache having been properly invalidated.
> 
> So, we must clear the flag first and then invalidate the pages. Doing
> this however, opens another race:
> 
> It's possible to have two concurrent read() calls that end up in
> nfs_revalidate_mapping at the same time. The first one clears the
> NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.
> 
> Just before calling that though, the other task races in, checks the
> flag and finds it cleared. At that point, it trusts that the mapping is
> good and gets the lock on the page, allowing the read() to be satisfied
> from the cache even though the data is no longer valid.
> 
> These effects are easily manifested by running diotest3 from the LTP
> test suite on NFS. That program does a series of DIO writes and buffered
> reads. The operations are serialized and page-aligned but the existing
> code fails the test since it occasionally allows a read to come out of
> the cache incorrectly. While mixing direct and buffered I/O isn't
> recommended, I believe it's possible to hit this in other ways that just
> use buffered I/O, though that situation is much harder to reproduce.
> 
> The problem is that the checking/clearing of that flag and the
> invalidation of the mapping really need to be atomic. Fix this by
> serializing concurrent invalidations with a bitlock.
> 
> At the same time, we also need to allow other places that check
> NFS_INO_INVALID_DATA to check whether we might be in the middle of
> invalidating the file, so fix up a couple of places that do that
> to look for the new NFS_INO_INVALIDATING flag.
> 
> Doing this requires us to be careful not to set the bitlock
> unnecessarily, so this code only does that if it believes it will
> be doing an invalidation.
> 
> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> ---
>  fs/nfs/dir.c           |  3 ++-
>  fs/nfs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++----
>  fs/nfs/nfstrace.h      |  1 +
>  fs/nfs/write.c         |  6 +++++-
>  include/linux/nfs_fs.h |  1 +
>  5 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index b266f73..b39a046 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -288,7 +288,8 @@ int nfs_readdir_search_for_cookie(struct nfs_cache_array *array, nfs_readdir_des
>  
>  			new_pos = desc->current_index + i;
>  			if (ctx->attr_gencount != nfsi->attr_gencount
> -			    || (nfsi->cache_validity & (NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA))) {
> +			    || (nfsi->cache_validity & (NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA))
> +			    || test_bit(NFS_INO_INVALIDATING, &nfsi->flags)) {
>  				ctx->duped = 0;
>  				ctx->attr_gencount = nfsi->attr_gencount;
>  			} else if (new_pos < desc->ctx->pos) {
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index c63e152..0a972ee 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -977,11 +977,11 @@ static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map
>  		if (ret < 0)
>  			return ret;
>  	}
> -	spin_lock(&inode->i_lock);
> -	nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> -	if (S_ISDIR(inode->i_mode))
> +	if (S_ISDIR(inode->i_mode)) {
> +		spin_lock(&inode->i_lock);
>  		memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
> -	spin_unlock(&inode->i_lock);
> +		spin_unlock(&inode->i_lock);
> +	}
>  	nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
>  	nfs_fscache_wait_on_invalidate(inode);
>  
> @@ -1008,6 +1008,7 @@ static bool nfs_mapping_need_revalidate_inode(struct inode *inode)
>  int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
>  {
>  	struct nfs_inode *nfsi = NFS_I(inode);
> +	unsigned long *bitlock = &nfsi->flags;
>  	int ret = 0;
>  
>  	/* swapfiles are not supposed to be shared. */
> @@ -1019,12 +1020,45 @@ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
>  		if (ret < 0)
>  			goto out;
>  	}
> +
> +	/*
> +	 * We must clear NFS_INO_INVALID_DATA first to ensure that
> +	 * invalidations that come in while we're shooting down the mappings
> +	 * are respected. But, that leaves a race window where one revalidator
> +	 * can clear the flag, and then another checks it before the mapping
> +	 * gets invalidated. Fix that by serializing access to this part of
> +	 * the function.
> +	 *
> +	 * At the same time, we need to allow other tasks to see whether we
> +	 * might be in the middle of invalidating the pages, so we only set
> +	 * the bit lock here if it looks like we're going to be doing that.
> +	 */
> +	for (;;) {
> +		ret = wait_on_bit(bitlock, NFS_INO_INVALIDATING,
> +				  nfs_wait_bit_killable, TASK_KILLABLE);
> +		if (ret)
> +			goto out;
> +		if (!(nfsi->cache_validity & NFS_INO_INVALID_DATA))
> +			goto out;
> +		if (!test_and_set_bit_lock(NFS_INO_INVALIDATING, bitlock))
> +			break;
> +	}
> +

So, I should mention that while the testcase does pass with this
patchset, I think there are still races in here.

I think it's possible for two tasks to enter this code nearly
simultaneously such that the bitlock is clear. One gets the lock and
then clears NFS_INO_INVALID_DATA, and the other then checks the flag,
finds it clear and exits the function before the pages were invalidated.

I wonder if we'd be better served with a new cache_validity flag
NFS_INO_INVALIDATING_DATA or something, and not have other functions
trying to peek at the bitlock.

> +	spin_lock(&inode->i_lock);
>  	if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> +		nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> +		spin_unlock(&inode->i_lock);
>  		trace_nfs_invalidate_mapping_enter(inode);
>  		ret = nfs_invalidate_mapping(inode, mapping);
>  		trace_nfs_invalidate_mapping_exit(inode, ret);
> +	} else {
> +		/* something raced in and cleared the flag */
> +		spin_unlock(&inode->i_lock);
>  	}
>  
> +	clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
> +	smp_mb__after_clear_bit();
> +	wake_up_bit(bitlock, NFS_INO_INVALIDATING);
>  out:
>  	return ret;
>  }
> diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
> index 89fe741..59f838c 100644
> --- a/fs/nfs/nfstrace.h
> +++ b/fs/nfs/nfstrace.h
> @@ -36,6 +36,7 @@
>  	__print_flags(v, "|", \
>  			{ 1 << NFS_INO_ADVISE_RDPLUS, "ADVISE_RDPLUS" }, \
>  			{ 1 << NFS_INO_STALE, "STALE" }, \
> +			{ 1 << NFS_INO_INVALIDATING, "INVALIDATING" }, \
>  			{ 1 << NFS_INO_FLUSHING, "FLUSHING" }, \
>  			{ 1 << NFS_INO_FSCACHE, "FSCACHE" }, \
>  			{ 1 << NFS_INO_COMMIT, "COMMIT" }, \
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a44a872..5511a42 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -909,9 +909,13 @@ bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx)
>   */
>  static bool nfs_write_pageuptodate(struct page *page, struct inode *inode)
>  {
> +	struct nfs_inode *nfsi = NFS_I(inode);
> +
>  	if (nfs_have_delegated_attributes(inode))
>  		goto out;
> -	if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> +	if (nfsi->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> +		return false;
> +	if (test_bit(NFS_INO_INVALIDATING, &nfsi->flags))
>  		return false;
>  out:
>  	return PageUptodate(page) != 0;
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 4899737..18fb16f 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -215,6 +215,7 @@ struct nfs_inode {
>  #define NFS_INO_ADVISE_RDPLUS	(0)		/* advise readdirplus */
>  #define NFS_INO_STALE		(1)		/* possible stale inode */
>  #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
> +#define NFS_INO_INVALIDATING	(3)		/* inode is being invalidated */
>  #define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
>  #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
>  #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html