On Tue, 2013-12-10 at 15:36 +0100, Jan Kara wrote: > On Mon 09-12-13 19:56:16, Chuck Lever wrote: > > On Dec 9, 2013, at 4:05 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > > Copying Trond's new work address. > > > > > > Begin forwarded message: > > > > > >> From: Dan Duval <dan.duval@xxxxxxxxxx> > > >> Subject: NFS file size anomaly? > > >> Date: December 9, 2013 3:04:27 PM EST > > >> To: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> > > >> Cc: <linux-nfs@xxxxxxxxxxxxxxx>, <linux-fsdevel@xxxxxxxxxxxxxxx> > > >> > > >> [NOTE: cross-posted to linux-nfs and linux-fsdevel] > > >> > > >> I'm seeing some unexpected behavior with NFS and file sizes. > > >> > > >> The test cases are from the LTP (Linux Test Project), tests > > >> ftest01, ftest05, and ftest07. I'll concentrate on ftest01 > > >> to explain what I've found. > > >> > > >> ftest01 fires off 5 subprocesses, each of which opens an empty > > >> file and does the following, repeatedly: > > >> > > >> . lseek to some point in the file > > >> . read 2048 bytes > > >> . lseek back to the same point > > >> . write 2048 bytes > > >> > > >> The "point in the file" is determined by a pseudo-random > > >> sequence. All such points are on 2048-byte boundaries. > > >> > > >> Occasionally, also driven pseudo-randomly, ftest01 will throw > > >> in a call to ftruncate(), truncate(), sync(), or fstat(). > > >> > > >> With the fstat() calls, the returned .st_size is compared > > >> with the test's expected size for the file, and an error is > > >> declared if they don't match. > > >> > > >> What's happening is that, some way into the test, this fstat() > > >> check is failing. Specifically, the .st_size reported by > > >> fstat() is greater than the computed size. > > >> > > >> The sequence of operations leading up to this is: > > >> > > >> lseek 1034240 0 > > >> read 2048 > > >> lseek 0 1 > > >> write 2048 > > >> > > >> lseek 638976 0 > > >> (read, lseek, write) > > >> > > >> lseek 708608 0 > > >> (read, lseek, write) > > >> > > >> lseek 708608 0 > > >> (read, lseek, write) > > >> > > >> lseek 679584 0 > > >> (read, lseek, write) > > >> > > >> truncate 266240 > > >> > > >> lseek 960512 0 > > >> (read, lseek, write) > > >> > > >> (a bunch of lseek/read/lseek/write ops that do not > > >> extend the file) > > >> > > >> fstat > > >> > > >> So the expected size of the file is 960512 + 2048 == 960560. > > >> But the fstat reports a size of 1036288. > > >> > > >> A look at what's happening on the wire, distilled from the > > >> output of tethereal, is instructive. > > >> > > >> READ Call 638976 4096 (byte offset and size to read) > > >> READ Reply 4096 995382 (bytes read and current file size) > > >> > > >> SETATTR Call 266240 (this corresponds to the truncate() call) > > >> > > >> WRITE Call 638976 4096 (byte offset and size to write) > > >> WRITE Call 708608 4096 > > >> WRITE Call 1032192 4096 > > >> > > >> SETATTR Reply 266240 (current size of file) > > >> > > >> WRITE Reply 643072 (current size of file after write) > > >> WRITE Reply 1036288 > > >> WRITE Reply 1036288 > > >> > > >> GETATTR (initiated internally by NFS code?) > > >> > > >> READ Call 958464 4096 READ Reply 4096 1036288 > > >> > > >> ... (a bunch of READ and WRITE ops that do not extend the file) > > >> > > >> GETATTR Call (this corresponds to the fstat() call) > > >> GETATTR Reply 1036288 > > >> > > >> So what appears to have happened here is that three of the > > >> WRITE operations that the program issued before the truncate() > > >> call have "bled past" the SETATTR, extending the file further > > >> than the SETATTR did. Since none of the operations issued > > >> after SETATTR extends the file further, by the time we get to > > >> the GETATTR, the file is larger than the test expects. > > >> > > >> There are two strange things going on here. The first, > > >> identified above, is that write()s that were initiated before > > >> the truncate() call are being processed after the resulting > > >> SETATTR call. The second is that WRITE operations are being > > >> initiated while the SETATTR is outstanding. > > >> > > >> It seems to me that a size-changing SETATTR operation should > > >> act essentially as an I/O barrier. It should wait for all outstanding > > >> read/write requests to complete, then issue the SETATTR, > > >> wait for the reply, and only then re-enable read/write requests. > > >> > > >> In other words, SETATTR should be atomic with respect to other > > >> I/O operations. > > > > nfs_setattr() relies on writeback_single_inode() [ via nfs_setattr() -> > > nfs_wb_all() -> sync_inode() -> writeback_single_inode() ] to serialize > > itself with other pending I/O operations. The NFS client already treats > > SETATTR as an I/O barrier. > > > > >> A git bisect indicates that this problem first appeared (or > > >> was first uncovered) with this commit: > > >> > > >> 4f8ad65 writeback: Refactor writeback_single_inode() > > >> > > >> It continues to the most recent mainline kernels. > > >> > > >> NFS v3 vs. v4 doesn't seem to matter. > > >> > > >> Has anyone else seen this? Any pointers you can provide? > > > > After commit 4f8ad65, writeback_single_inode() occasionally returns > > immediately instead of waiting for writeback to finish. The behavior > > would be difficult or impossible to spot with a local filesystem, but > > with NFS, a race is clearly exposed. > That looks like a bug. But the only thing that's changed by 4f8ad65 is > the additional test: > if (!(inode->i_state & I_DIRTY)) > goto out; > So it seems NFS has inodes that have dirty data but don't have (any of) inode > dirty bits set? But I've been staring at the code and don't see any > problem. Any idea Chuck? Otherwise I guess I'll have to reproduce this and > debug what's going on. Hi Jan, I'm guessing that the problem is the fact that you can clear the I_DIRTY_PAGES flag inside __writeback_single_inode() on a non-waiting call (i.e. one which does not have WB_SYNC_ALL set). When that happens, the NFS client ends up with inode->i_state == 0 until one of the WRITEs returns with the "unstable" flag set (which triggers a call to __mark_inode_dirty(inode, I_DIRTY_DATASYNC) so that we call write_inode()). Basically, if you have WB_SYNC_ALL set, I think that you cannot apply the test for (inode->i_state & I_DIRTY) until you have called filemap_fdatawait(). Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html