On Feb 26, 2015, at 12:27 PM, Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote: > On Thu, Feb 26, 2015 at 11:26 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >> I’m breaking this into a separate thread, since you correctly >> pointed out that this is not the same problem that Chris sees. >> >> Begin forwarded message: >> >>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>> Subject: Re: File Read Returns Non-existent Null Bytes >>> Date: February 25, 2015 at 8:27:37 PM EST >>> To: Chuck Lever <chuck.lever@xxxxxxxxxx> >>> Cc: Chris Perl <cperl@xxxxxxxxxxxxxx>, Linux NFS Mailing List <linux-nfs@xxxxxxxxxxxxxxx>, Chris Perl <chris.perl@xxxxxxxxx> >>> >>> On Wed, 2015-02-25 at 19:43 -0500, Trond Myklebust wrote: >>>> On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote: >>>>> On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote: >>>>>> FWIW it’s easy to reproduce a similar race with fsx, and I encounter >>>>>> it frequently while running xfstests on fast NFS servers. >>>>>> >>>>>> fsx invokes ftruncate following a set of asynchronous reads >>>>>> (generated possibly due to readahead). The reads are started first, >>>>>> then the SETATTR, but they complete out of order. >>>>>> >>>>>> The SETATTR changes the test file’s size, and the completion >>>>>> updates the file size in the client’s inode. Then the read requests >>>>>> complete on the client and set the file’s size back to its old value. >>>>>> >>>>>> All it takes is one late read completion, and the cached file size >>>>>> is corrupted. fsx detects the file size mismatch and terminates the >>>>>> test. The file size is corrected by a subsequent GETATTR (say, an >>>>>> “ls -l” to check it after fsx has terminated). >>>>>> >>>>>> While SETATTR blocks concurrent writes, there’s no serialization >>>>>> on either the client or server to help guarantee the ordering of >>>>>> SETATTR with read operations. >>>>>> >>>>>> I’ve found a successful workaround by forcing the client to ignore >>>>>> post-op attrs in read replies. A stronger solution might simply set >>>>>> the “file attributes need update” flag in the inode if any file >>>>>> attribute mutation is noticed during a read completion. >>>>> >>>>> That's different. We definitely should aim to fix this kind of issue >>>>> since you are talking about a single client being the only thing >>>>> accessing the file on the server. >>>>> How about the following patch? >>>> >>>> Let me retry without the typos. The following will actually >>>> compile... :-/ >>> >>> Third version: this contains a minor optimisation so that we don't force >>> the re-read of the last page in the file. Also a little more detail in >>> the changelog. >> >> Test results: >> >> generic/127 74s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad) >> --- tests/generic/127.out 2014-02-13 15:40:45.202103271 -0500 >> +++ /home/cel/src/xfstests/results//generic/127.out.bad 2015-02-26 10:46:24.284735977 -0500 >> @@ -4,7 +4,10011 @@ >> === FSX Light Mode, Memory Mapping === >> All operations completed A-OK! >> === FSX Standard Mode, No Memory Mapping === >> -All operations completed A-OK! >> +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap >> +Size error: expected 0x36f6f stat 0x3e67a seek 0x3e67a >> +LOG DUMP (83542 total operations): >> ... >> (Run 'diff -u tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff) >> >> This happens when the server exports a tmpfs, for example, which >> mimics the performance characteristics of an NVM-based filesystem. >> Otherwise, SETATTR will be slow and the race is typically avoided. >> Well, that’s my theory, anyway. >> >> >>> 8<-------------------------------------------------------------------- >>> From d2c822d64a68542665e4887b4d7bccd6e8e3a741 Mon Sep 17 00:00:00 2001 >>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>> Date: Wed, 25 Feb 2015 19:26:28 -0500 >>> Subject: [PATCH] NFS: Quiesce reads before updating the file size after >>> truncating >>> >>> Chuck Lever reports seeing readaheads racing with truncate operations >>> and causing the file size to be reverted. Fix is to quiesce reads >>> after truncating the file on the server, but before updating the >>> size on the client. >>> >>> Note: >>> >>> 1) We believe that waiting for those reads that lie outside the new >>> file boundaries is sufficient, since this means any attempt to read >>> the data in those areas will trigger a new RPC call, which will be >>> ordered w.r.t. the size change on the server. >> >> Certainly reads outside the new file size have _data_ that can >> be discarded. >> >> But in this case: The server processes the operations in the order >> the client sent them. The READs first, then the SETATTR. That’s why >> the READ replies still have the old file size. >> >> But queuing on the server causes the replies to be returned to the >> client out of order. The client applies the post-op attributes in >> that order, and the old file size replaces the new size. >> >> Therefore I think _any_ READ that is performed on the server before >> the SETATTR, but completes on the client after the SETATTR, will >> poison the file’s attribute cache on the client. >> > > Why doesn't our usual trick of using the ctime to discriminate between > older and newer file sizes work here? Is the problem that the > resolution is insufficient? I’m not sure. These operations can occur during the same millisecond, so it is plausible there’s a timestamp resolution issue. Or the server doesn’t return pre-op attrs. Timestamp is more likely, though: I have seen this issue occur on NFSv4 as well. > The alternative would be to add a "barrier" operation to allow the > setattr code to wait for all RPC calls before ours to complete (it > would have to wait not just for READ, but also GETATTR and all other > operations to the file). We could do that too. Yes. A barrier is usually heavyweight, though. And I thought there was some exclusion between GETATTR and SETATTR, so forcing an inode revalidation should work too? Or would that result in an unneeded data cache invalidation? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html