Copying Trond's new work address. Begin forwarded message: > From: Dan Duval <dan.duval@xxxxxxxxxx> > Subject: NFS file size anomaly? > Date: December 9, 2013 3:04:27 PM EST > To: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> > Cc: <linux-nfs@xxxxxxxxxxxxxxx>, <linux-fsdevel@xxxxxxxxxxxxxxx> > > [NOTE: cross-posted to linux-nfs and linux-fsdevel] > > I'm seeing some unexpected behavior with NFS and file sizes. > > The test cases are from the LTP (Linux Test Project), tests > ftest01, ftest05, and ftest07. I'll concentrate on ftest01 > to explain what I've found. > > ftest01 fires off 5 subprocesses, each of which opens an empty > file and does the following, repeatedly: > > . lseek to some point in the file > . read 2048 bytes > . lseek back to the same point > . write 2048 bytes > > The "point in the file" is determined by a pseudo-random > sequence. All such points are on 2048-byte boundaries. > > Occasionally, also driven pseudo-randomly, ftest01 will throw > in a call to ftruncate(), truncate(), sync(), or fstat(). > > With the fstat() calls, the returned .st_size is compared > with the test's expected size for the file, and an error is > declared if they don't match. > > What's happening is that, some way into the test, this fstat() > check is failing. Specifically, the .st_size reported by > fstat() is greater than the computed size. > > The sequence of operations leading up to this is: > > lseek 1034240 0 > read 2048 > lseek 0 1 > write 2048 > > lseek 638976 0 > (read, lseek, write) > > lseek 708608 0 > (read, lseek, write) > > lseek 708608 0 > (read, lseek, write) > > lseek 679584 0 > (read, lseek, write) > > truncate 266240 > > lseek 960512 0 > (read, lseek, write) > > (a bunch of lseek/read/lseek/write ops that do not > extend the file) > > fstat > > So the expected size of the file is 960512 + 2048 == 960560. > But the fstat reports a size of 1036288. > > A look at what's happening on the wire, distilled from the > output of tethereal, is instructive. > > READ Call 638976 4096 (byte offset and size to read) > READ Reply 4096 995382 (bytes read and current file size) > > SETATTR Call 266240 (this corresponds to the truncate() call) > > WRITE Call 638976 4096 (byte offset and size to write) > WRITE Call 708608 4096 > WRITE Call 1032192 4096 > > SETATTR Reply 266240 (current size of file) > > WRITE Reply 643072 (current size of file after write) > WRITE Reply 1036288 > WRITE Reply 1036288 > > GETATTR (initiated internally by NFS code?) > > READ Call 958464 4096 READ Reply 4096 1036288 > > ... (a bunch of READ and WRITE ops that do not extend the file) > > GETATTR Call (this corresponds to the fstat() call) > GETATTR Reply 1036288 > > So what appears to have happened here is that three of the > WRITE operations that the program issued before the truncate() > call have "bled past" the SETATTR, extending the file further > than the SETATTR did. Since none of the operations issued > after SETATTR extends the file further, by the time we get to > the GETATTR, the file is larger than the test expects. > > There are two strange things going on here. The first, > identified above, is that write()s that were initiated before > the truncate() call are being processed after the resulting > SETATTR call. The second is that WRITE operations are being > initiated while the SETATTR is outstanding. > > It seems to me that a size-changing SETATTR operation should > act essentially as an I/O barrier. It should wait for all outstanding > read/write requests to complete, then issue the SETATTR, > wait for the reply, and only then re-enable read/write requests. > > In other words, SETATTR should be atomic with respect to other > I/O operations. > > A git bisect indicates that this problem first appeared (or > was first uncovered) with this commit: > > 4f8ad65 writeback: Refactor writeback_single_inode() > > It continues to the most recent mainline kernels. > > NFS v3 vs. v4 doesn't seem to matter. > > Has anyone else seen this? Any pointers you can provide? > > Thanks, > Dan Duval > Oracle Corp. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html