Re: File Read Returns Non-existent Null Bytes

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 26 Feb 2015 10:08:34 -0500

On Feb 25, 2015, at 8:27 PM, Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:

> On Wed, 2015-02-25 at 19:43 -0500, Trond Myklebust wrote:
>> On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote:
>>> On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote:
>>>> FWIW it’s easy to reproduce a similar race with fsx, and I encounter
>>>> it frequently while running xfstests on fast NFS servers.
>>>> 
>>>> fsx invokes ftruncate following a set of asynchronous reads
>>>> (generated possibly due to readahead). The reads are started first,
>>>> then the SETATTR, but they complete out of order.
>>>> 
>>>> The SETATTR changes the test file’s size, and the completion
>>>> updates the file size in the client’s inode. Then the read requests
>>>> complete on the client and set the file’s size back to its old value.
>>>> 
>>>> All it takes is one late read completion, and the cached file size
>>>> is corrupted. fsx detects the file size mismatch and terminates the
>>>> test. The file size is corrected by a subsequent GETATTR (say, an
>>>> “ls -l” to check it after fsx has terminated).
>>>> 
>>>> While SETATTR blocks concurrent writes, there’s no serialization
>>>> on either the client or server to help guarantee the ordering of
>>>> SETATTR with read operations.
>>>> 
>>>> I’ve found a successful workaround by forcing the client to ignore
>>>> post-op attrs in read replies. A stronger solution might simply set
>>>> the “file attributes need update” flag in the inode if any file
>>>> attribute mutation is noticed during a read completion.
>>> 
>>> That's different. We definitely should aim to fix this kind of issue
>>> since you are talking about a single client being the only thing
>>> accessing the file on the server.
>>> How about the following patch?
>> 
>> Let me retry without the typos. The following will actually
>> compile... :-/
> 
> Third version: this contains a minor optimisation so that we don't force
> the re-read of the last page in the file. Also a little more detail in
> the changelog.

Thanks, I’ll give this a shot, hopefully today.

> 8<--------------------------------------------------------------------
> From d2c822d64a68542665e4887b4d7bccd6e8e3a741 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> Date: Wed, 25 Feb 2015 19:26:28 -0500
> Subject: [PATCH] NFS: Quiesce reads before updating the file size after
> truncating
> 
> Chuck Lever reports seeing readaheads racing with truncate operations
> and causing the file size to be reverted. Fix is to quiesce reads
> after truncating the file on the server, but before updating the
> size on the client.
> 
> Note:
> 
> 1) We believe that waiting for those reads that lie outside the new
> file boundaries is sufficient, since this means any attempt to read
> the data in those areas will trigger a new RPC call, which will be
> ordered w.r.t.  the size change on the server.
> 
> 2) We do not move the call to truncate_pagecache(), since we want
> to ensure that any partial page zeroing is ordered w.r.t. the
> update of inode->i_size.
> 
> Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> ---
> fs/nfs/inode.c | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 83107be3dd01..daf3b1e183fe 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -565,6 +565,11 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
> 	if (err)
> 		goto out;
> 
> +	/* Quiesce reads that lie outside the new file size */
> +	invalidate_inode_pages2_range(inode->i_mapping,
> +			(offset + PAGE_CACHE_SHIFT - 1) >> PAGE_CACHE_SHIFT,
> +			-1);
> +
> 	spin_lock(&inode->i_lock);
> 	i_size_write(inode, offset);
> 	/* Optimisation */
> -- 
> 2.1.0
> 
> 
> 

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html