Re: File Read Returns Non-existent Null Bytes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Feb 25, 2015, at 4:47 PM, Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:

> On Wed, Feb 25, 2015 at 4:02 PM, Chris Perl <cperl@xxxxxxxxxxxxxx> wrote:
>>> So imagine 2 WRITE calls that are being sent to an initially empty
>>> file. One WRITE call is for offset 0, and length 4096 bytes. The
>>> second call is for offset 4096 and length 4096 bytes.
>>> Imagine now that the first WRITE gets delayed (either because the page
>>> cache isn't flushing that part of the file yet or because it gets
>>> re-ordered in the RPC layer or on the server), and the second WRITE is
>>> received and processed by the server first.
>>> Once the delayed WRITE is processed there will be data at offset 0,
>>> but until that happens, anyone reading the file on the server will see
>>> a hole of length 4096 bytes.
>>> 
>>> This kind of issue is why close-to-open cache consistency relies on
>>> only one client accessing the file on the server when it is open for
>>> writing.
>> 
>> Fair enough.  I am taking note of the fact that you said "This kind of
>> issue" implying there are probably other subtle cases I'm not thinking
>> about or that your example does not illustrate.
>> 
>> That said, in your example, there exists some moment in time when the
>> file on the server actually does have a hole in it full of 0's.  In my
>> case, the file never contains 0's.
>> 
>> To be fair, when testing with an Isilon, I can't actually inspect the
>> state of the file on the server in any meaningful way, so I can't be
>> certain that's true.  But, from the view point of the reading client
>> at the NFS layer there are never 0's read back across the wire.  I've
>> confirmed this by matching up wireshark traces while reproducing and
>> the READ reply's never contain 0's.  The 0's manifest due to reading
>> too far past where there is valid data in the page cache.
> 
> Then that could be a GETATTR or something similar extending the file
> size outside the READ rpc call. Since the pagecache data is copied to
> userspace without any locks being held, we cannot prevent that race.

FWIW it’s easy to reproduce a similar race with fsx, and I encounter
it frequently while running xfstests on fast NFS servers.

fsx invokes ftruncate following a set of asynchronous reads
(generated possibly due to readahead). The reads are started first,
then the SETATTR, but they complete out of order.

The SETATTR changes the test file’s size, and the completion
updates the file size in the client’s inode. Then the read requests
complete on the client and set the file’s size back to its old value.

All it takes is one late read completion, and the cached file size
is corrupted. fsx detects the file size mismatch and terminates the
test. The file size is corrected by a subsequent GETATTR (say, an
“ls -l” to check it after fsx has terminated).

While SETATTR blocks concurrent writes, there’s no serialization
on either the client or server to help guarantee the ordering of
SETATTR with read operations.

I’ve found a successful workaround by forcing the client to ignore
post-op attrs in read replies. A stronger solution might simply set
the “file attributes need update” flag in the inode if any file
attribute mutation is noticed during a read completion.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux