Fwd: NFS file size anomaly?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 9 Dec 2013 16:05:40 -0500

Copying Trond's new work address.

Begin forwarded message:

> From: Dan Duval <dan.duval@xxxxxxxxxx>
> Subject: NFS file size anomaly?
> Date: December 9, 2013 3:04:27 PM EST
> To: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
> Cc: <linux-nfs@xxxxxxxxxxxxxxx>, <linux-fsdevel@xxxxxxxxxxxxxxx>
> 
> [NOTE: cross-posted to linux-nfs and linux-fsdevel]
> 
> I'm seeing some unexpected behavior with NFS and file sizes.
> 
> The test cases are from the LTP (Linux Test Project), tests
> ftest01, ftest05, and ftest07.  I'll concentrate on ftest01
> to explain what I've found.
> 
> ftest01 fires off 5 subprocesses, each of which opens an empty
> file and does the following, repeatedly:
> 
>        . lseek to some point in the file
>        . read 2048 bytes
>        . lseek back to the same point
>        . write 2048 bytes
> 
> The "point in the file" is determined by a pseudo-random
> sequence.  All such points are on 2048-byte boundaries.
> 
> Occasionally, also driven pseudo-randomly, ftest01 will throw
> in a call to ftruncate(), truncate(), sync(), or fstat().
> 
> With the fstat() calls, the returned .st_size is compared
> with the test's expected size for the file, and an error is
> declared if they don't match.
> 
> What's happening is that, some way into the test, this fstat()
> check is failing.  Specifically, the .st_size reported by
> fstat() is greater than the computed size.
> 
> The sequence of operations leading up to this is:
> 
>        lseek 1034240 0
>        read 2048
>        lseek 0 1
>        write 2048
> 
>        lseek 638976 0
>        (read, lseek, write)
> 
>        lseek 708608 0
>        (read, lseek, write)
> 
>        lseek 708608 0
>        (read, lseek, write)
> 
>        lseek 679584 0
>        (read, lseek, write)
> 
>        truncate 266240
> 
>        lseek 960512 0
>        (read, lseek, write)
> 
>        (a bunch of lseek/read/lseek/write ops that do not
>         extend the file)
> 
>        fstat
> 
> So the expected size of the file is 960512 + 2048 == 960560.
> But the fstat reports a size of 1036288.
> 
> A look at what's happening on the wire, distilled from the
> output of tethereal, is instructive.
> 
>        READ Call 638976 4096 (byte offset and size to read)
>        READ Reply 4096 995382 (bytes read and current file size)
> 
>        SETATTR Call 266240 (this corresponds to the truncate() call)
> 
>        WRITE Call 638976 4096 (byte offset and size to write)
>        WRITE Call 708608 4096
>        WRITE Call 1032192 4096
> 
>        SETATTR Reply 266240 (current size of file)
> 
>        WRITE Reply 643072 (current size of file after write)
>        WRITE Reply 1036288
>        WRITE Reply 1036288
> 
>        GETATTR (initiated internally by NFS code?)
> 
>        READ Call 958464 4096 READ Reply 4096 1036288
> 
>        ... (a bunch of READ and WRITE ops that do not extend the file)
> 
>        GETATTR Call (this corresponds to the fstat() call)
>        GETATTR Reply 1036288
> 
> So what appears to have happened here is that three of the
> WRITE operations that the program issued before the truncate()
> call have "bled past" the SETATTR, extending the file further
> than the SETATTR did.  Since none of the operations issued
> after SETATTR extends the file further, by the time we get to
> the GETATTR, the file is larger than the test expects.
> 
> There are two strange things going on here.  The first,
> identified above, is that write()s that were initiated before
> the truncate() call are being processed after the resulting
> SETATTR call.  The second is that WRITE operations are being
> initiated while the SETATTR is outstanding.
> 
> It seems to me that a size-changing SETATTR operation should
> act essentially as an I/O barrier. It should wait for all outstanding
> read/write requests to complete, then issue the SETATTR,
> wait for the reply, and only then re-enable read/write requests.
> 
> In other words, SETATTR should be atomic with respect to other
> I/O operations.
> 
> A git bisect indicates that this problem first appeared (or
> was first uncovered) with this commit:
> 
>    4f8ad65 writeback: Refactor writeback_single_inode()
> 
> It continues to the most recent mainline kernels.
> 
> NFS v3 vs. v4 doesn't seem to matter.
> 
> Has anyone else seen this?  Any pointers you can provide?
> 
> Thanks,
> Dan Duval
> Oracle Corp.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html