Re: NFS file size anomaly?

Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> · Tue, 10 Dec 2013 19:01:21 +0200

On Tue, 2013-12-10 at 15:36 +0100, Jan Kara wrote:
> On Mon 09-12-13 19:56:16, Chuck Lever wrote:
> > On Dec 9, 2013, at 4:05 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> > > Copying Trond's new work address.
> > > 
> > > Begin forwarded message:
> > > 
> > >> From: Dan Duval <dan.duval@xxxxxxxxxx>
> > >> Subject: NFS file size anomaly?
> > >> Date: December 9, 2013 3:04:27 PM EST
> > >> To: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
> > >> Cc: <linux-nfs@xxxxxxxxxxxxxxx>, <linux-fsdevel@xxxxxxxxxxxxxxx>
> > >> 
> > >> [NOTE: cross-posted to linux-nfs and linux-fsdevel]
> > >> 
> > >> I'm seeing some unexpected behavior with NFS and file sizes.
> > >> 
> > >> The test cases are from the LTP (Linux Test Project), tests
> > >> ftest01, ftest05, and ftest07.  I'll concentrate on ftest01
> > >> to explain what I've found.
> > >> 
> > >> ftest01 fires off 5 subprocesses, each of which opens an empty
> > >> file and does the following, repeatedly:
> > >> 
> > >>       . lseek to some point in the file
> > >>       . read 2048 bytes
> > >>       . lseek back to the same point
> > >>       . write 2048 bytes
> > >> 
> > >> The "point in the file" is determined by a pseudo-random
> > >> sequence.  All such points are on 2048-byte boundaries.
> > >> 
> > >> Occasionally, also driven pseudo-randomly, ftest01 will throw
> > >> in a call to ftruncate(), truncate(), sync(), or fstat().
> > >> 
> > >> With the fstat() calls, the returned .st_size is compared
> > >> with the test's expected size for the file, and an error is
> > >> declared if they don't match.
> > >> 
> > >> What's happening is that, some way into the test, this fstat()
> > >> check is failing.  Specifically, the .st_size reported by
> > >> fstat() is greater than the computed size.
> > >> 
> > >> The sequence of operations leading up to this is:
> > >> 
> > >>       lseek 1034240 0
> > >>       read 2048
> > >>       lseek 0 1
> > >>       write 2048
> > >> 
> > >>       lseek 638976 0
> > >>       (read, lseek, write)
> > >> 
> > >>       lseek 708608 0
> > >>       (read, lseek, write)
> > >> 
> > >>       lseek 708608 0
> > >>       (read, lseek, write)
> > >> 
> > >>       lseek 679584 0
> > >>       (read, lseek, write)
> > >> 
> > >>       truncate 266240
> > >> 
> > >>       lseek 960512 0
> > >>       (read, lseek, write)
> > >> 
> > >>       (a bunch of lseek/read/lseek/write ops that do not
> > >>        extend the file)
> > >> 
> > >>       fstat
> > >> 
> > >> So the expected size of the file is 960512 + 2048 == 960560.
> > >> But the fstat reports a size of 1036288.
> > >> 
> > >> A look at what's happening on the wire, distilled from the
> > >> output of tethereal, is instructive.
> > >> 
> > >>       READ Call 638976 4096 (byte offset and size to read)
> > >>       READ Reply 4096 995382 (bytes read and current file size)
> > >> 
> > >>       SETATTR Call 266240 (this corresponds to the truncate() call)
> > >> 
> > >>       WRITE Call 638976 4096 (byte offset and size to write)
> > >>       WRITE Call 708608 4096
> > >>       WRITE Call 1032192 4096
> > >> 
> > >>       SETATTR Reply 266240 (current size of file)
> > >> 
> > >>       WRITE Reply 643072 (current size of file after write)
> > >>       WRITE Reply 1036288
> > >>       WRITE Reply 1036288
> > >> 
> > >>       GETATTR (initiated internally by NFS code?)
> > >> 
> > >>       READ Call 958464 4096 READ Reply 4096 1036288
> > >> 
> > >>       ... (a bunch of READ and WRITE ops that do not extend the file)
> > >> 
> > >>       GETATTR Call (this corresponds to the fstat() call)
> > >>       GETATTR Reply 1036288
> > >> 
> > >> So what appears to have happened here is that three of the
> > >> WRITE operations that the program issued before the truncate()
> > >> call have "bled past" the SETATTR, extending the file further
> > >> than the SETATTR did.  Since none of the operations issued
> > >> after SETATTR extends the file further, by the time we get to
> > >> the GETATTR, the file is larger than the test expects.
> > >> 
> > >> There are two strange things going on here.  The first,
> > >> identified above, is that write()s that were initiated before
> > >> the truncate() call are being processed after the resulting
> > >> SETATTR call.  The second is that WRITE operations are being
> > >> initiated while the SETATTR is outstanding.
> > >> 
> > >> It seems to me that a size-changing SETATTR operation should
> > >> act essentially as an I/O barrier. It should wait for all outstanding
> > >> read/write requests to complete, then issue the SETATTR,
> > >> wait for the reply, and only then re-enable read/write requests.
> > >> 
> > >> In other words, SETATTR should be atomic with respect to other
> > >> I/O operations.
> > 
> > nfs_setattr() relies on writeback_single_inode() [ via nfs_setattr() ->
> > nfs_wb_all() -> sync_inode() -> writeback_single_inode() ] to serialize
> > itself with other pending I/O operations.  The NFS client already treats
> > SETATTR as an I/O barrier. 
> > 
> > >> A git bisect indicates that this problem first appeared (or
> > >> was first uncovered) with this commit:
> > >> 
> > >>   4f8ad65 writeback: Refactor writeback_single_inode()
> > >> 
> > >> It continues to the most recent mainline kernels.
> > >> 
> > >> NFS v3 vs. v4 doesn't seem to matter.
> > >> 
> > >> Has anyone else seen this?  Any pointers you can provide?
> > 
> > After commit 4f8ad65, writeback_single_inode() occasionally returns
> > immediately instead of waiting for writeback to finish.  The behavior
> > would be difficult or impossible to spot with a local filesystem, but
> > with NFS, a race is clearly exposed.
>   That looks like a bug. But the only thing that's changed by 4f8ad65 is
> the additional test:
> 	if (!(inode->i_state & I_DIRTY))
> 		goto out;
> So it seems NFS has inodes that have dirty data but don't have (any of) inode
> dirty bits set? But I've been staring at the code and don't see any
> problem. Any idea Chuck? Otherwise I guess I'll have to reproduce this and
> debug what's going on.

Hi Jan,

I'm guessing that the problem is the fact that you can clear the
I_DIRTY_PAGES flag inside __writeback_single_inode() on a non-waiting
call (i.e. one which does not have WB_SYNC_ALL set).
When that happens, the NFS client ends up with inode->i_state == 0 until
one of the WRITEs returns with the "unstable" flag set (which triggers a
call to __mark_inode_dirty(inode, I_DIRTY_DATASYNC) so that we call
write_inode()).

Basically, if you have WB_SYNC_ALL set, I think that you cannot apply
the test for (inode->i_state & I_DIRTY) until you have called 
filemap_fdatawait().

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html