On Tue 10-12-13 19:01:21, Trond Myklebust wrote: > On Tue, 2013-12-10 at 15:36 +0100, Jan Kara wrote: > > On Mon 09-12-13 19:56:16, Chuck Lever wrote: > > > On Dec 9, 2013, at 4:05 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > > > Copying Trond's new work address. > > > > > > > > Begin forwarded message: > > > > > > > >> From: Dan Duval <dan.duval@xxxxxxxxxx> > > > >> Subject: NFS file size anomaly? > > > >> Date: December 9, 2013 3:04:27 PM EST > > > >> To: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> > > > >> Cc: <linux-nfs@xxxxxxxxxxxxxxx>, <linux-fsdevel@xxxxxxxxxxxxxxx> > > > >> > > > >> [NOTE: cross-posted to linux-nfs and linux-fsdevel] > > > >> > > > >> I'm seeing some unexpected behavior with NFS and file sizes. > > > >> > > > >> The test cases are from the LTP (Linux Test Project), tests > > > >> ftest01, ftest05, and ftest07. I'll concentrate on ftest01 > > > >> to explain what I've found. > > > >> > > > >> ftest01 fires off 5 subprocesses, each of which opens an empty > > > >> file and does the following, repeatedly: > > > >> > > > >> . lseek to some point in the file > > > >> . read 2048 bytes > > > >> . lseek back to the same point > > > >> . write 2048 bytes > > > >> > > > >> The "point in the file" is determined by a pseudo-random > > > >> sequence. All such points are on 2048-byte boundaries. > > > >> > > > >> Occasionally, also driven pseudo-randomly, ftest01 will throw > > > >> in a call to ftruncate(), truncate(), sync(), or fstat(). > > > >> > > > >> With the fstat() calls, the returned .st_size is compared > > > >> with the test's expected size for the file, and an error is > > > >> declared if they don't match. > > > >> > > > >> What's happening is that, some way into the test, this fstat() > > > >> check is failing. Specifically, the .st_size reported by > > > >> fstat() is greater than the computed size. > > > >> > > > >> The sequence of operations leading up to this is: > > > >> > > > >> lseek 1034240 0 > > > >> read 2048 > > > >> lseek 0 1 > > > >> write 2048 > > > >> > > > >> lseek 638976 0 > > > >> (read, lseek, write) > > > >> > > > >> lseek 708608 0 > > > >> (read, lseek, write) > > > >> > > > >> lseek 708608 0 > > > >> (read, lseek, write) > > > >> > > > >> lseek 679584 0 > > > >> (read, lseek, write) > > > >> > > > >> truncate 266240 > > > >> > > > >> lseek 960512 0 > > > >> (read, lseek, write) > > > >> > > > >> (a bunch of lseek/read/lseek/write ops that do not > > > >> extend the file) > > > >> > > > >> fstat > > > >> > > > >> So the expected size of the file is 960512 + 2048 == 960560. > > > >> But the fstat reports a size of 1036288. > > > >> > > > >> A look at what's happening on the wire, distilled from the > > > >> output of tethereal, is instructive. > > > >> > > > >> READ Call 638976 4096 (byte offset and size to read) > > > >> READ Reply 4096 995382 (bytes read and current file size) > > > >> > > > >> SETATTR Call 266240 (this corresponds to the truncate() call) > > > >> > > > >> WRITE Call 638976 4096 (byte offset and size to write) > > > >> WRITE Call 708608 4096 > > > >> WRITE Call 1032192 4096 > > > >> > > > >> SETATTR Reply 266240 (current size of file) > > > >> > > > >> WRITE Reply 643072 (current size of file after write) > > > >> WRITE Reply 1036288 > > > >> WRITE Reply 1036288 > > > >> > > > >> GETATTR (initiated internally by NFS code?) > > > >> > > > >> READ Call 958464 4096 READ Reply 4096 1036288 > > > >> > > > >> ... (a bunch of READ and WRITE ops that do not extend the file) > > > >> > > > >> GETATTR Call (this corresponds to the fstat() call) > > > >> GETATTR Reply 1036288 > > > >> > > > >> So what appears to have happened here is that three of the > > > >> WRITE operations that the program issued before the truncate() > > > >> call have "bled past" the SETATTR, extending the file further > > > >> than the SETATTR did. Since none of the operations issued > > > >> after SETATTR extends the file further, by the time we get to > > > >> the GETATTR, the file is larger than the test expects. > > > >> > > > >> There are two strange things going on here. The first, > > > >> identified above, is that write()s that were initiated before > > > >> the truncate() call are being processed after the resulting > > > >> SETATTR call. The second is that WRITE operations are being > > > >> initiated while the SETATTR is outstanding. > > > >> > > > >> It seems to me that a size-changing SETATTR operation should > > > >> act essentially as an I/O barrier. It should wait for all outstanding > > > >> read/write requests to complete, then issue the SETATTR, > > > >> wait for the reply, and only then re-enable read/write requests. > > > >> > > > >> In other words, SETATTR should be atomic with respect to other > > > >> I/O operations. > > > > > > nfs_setattr() relies on writeback_single_inode() [ via nfs_setattr() -> > > > nfs_wb_all() -> sync_inode() -> writeback_single_inode() ] to serialize > > > itself with other pending I/O operations. The NFS client already treats > > > SETATTR as an I/O barrier. > > > > > > >> A git bisect indicates that this problem first appeared (or > > > >> was first uncovered) with this commit: > > > >> > > > >> 4f8ad65 writeback: Refactor writeback_single_inode() > > > >> > > > >> It continues to the most recent mainline kernels. > > > >> > > > >> NFS v3 vs. v4 doesn't seem to matter. > > > >> > > > >> Has anyone else seen this? Any pointers you can provide? > > > > > > After commit 4f8ad65, writeback_single_inode() occasionally returns > > > immediately instead of waiting for writeback to finish. The behavior > > > would be difficult or impossible to spot with a local filesystem, but > > > with NFS, a race is clearly exposed. > > That looks like a bug. But the only thing that's changed by 4f8ad65 is > > the additional test: > > if (!(inode->i_state & I_DIRTY)) > > goto out; > > So it seems NFS has inodes that have dirty data but don't have (any of) inode > > dirty bits set? But I've been staring at the code and don't see any > > problem. Any idea Chuck? Otherwise I guess I'll have to reproduce this and > > debug what's going on. > > Hi Jan, > > I'm guessing that the problem is the fact that you can clear the > I_DIRTY_PAGES flag inside __writeback_single_inode() on a non-waiting > call (i.e. one which does not have WB_SYNC_ALL set). > When that happens, the NFS client ends up with inode->i_state == 0 until > one of the WRITEs returns with the "unstable" flag set (which triggers a > call to __mark_inode_dirty(inode, I_DIRTY_DATASYNC) so that we call > write_inode()). > > Basically, if you have WB_SYNC_ALL set, I think that you cannot apply > the test for (inode->i_state & I_DIRTY) until you have called > filemap_fdatawait(). Ah, thanks for explanation. I should have spotted this myself. So something like the attached patch should fix the issue. Dan, can you try whether the attached patch fixes your problem? Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR
>From 18eb76f3c2c7e8f3f57380846677097fe886232e Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@xxxxxxx> Date: Wed, 11 Dec 2013 12:15:57 +0100 Subject: [PATCH] writeback: Fix data corruption on NFS Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added a condition to skip clean inode. However this is wrong in WB_SYNC_ALL mode because there we also want to wait for outstanding writeback on possibly clean inode. This was causing occasional data corruption issues on NFS because it uses sync_inode() to make sure all outstanding writes are flushed to the server before truncating the inode and with sync_inode() returning prematurely file was sometimes extended back by an outstanding write after it was truncated. So modify the test to also check for pages under writeback in WB_SYNC_ALL mode. CC: stable@xxxxxxxxxxxxxxx # >= 3.5 Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93 Reported-by: Dan Duval <dan.duval@xxxxxxxxxx> Signed-off-by: Jan Kara <jack@xxxxxxx> --- fs/fs-writeback.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1f4a10ece2f1..e0259a163f98 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -516,13 +516,16 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, } WARN_ON(inode->i_state & I_SYNC); /* - * Skip inode if it is clean. We don't want to mess with writeback - * lists in this function since flusher thread may be doing for example - * sync in parallel and if we move the inode, it could get skipped. So - * here we make sure inode is on some writeback list and leave it there - * unless we have completely cleaned the inode. + * Skip inode if it is clean and we have no outstanding writeback in + * WB_SYNC_ALL mode. We don't want to mess with writeback lists in this + * function since flusher thread may be doing for example sync in + * parallel and if we move the inode, it could get skipped. So here we + * make sure inode is on some writeback list and leave it there unless + * we have completely cleaned the inode. */ - if (!(inode->i_state & I_DIRTY)) + if (!(inode->i_state & I_DIRTY) && + (wbc->sync_mode != WB_SYNC_ALL || + !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))) goto out; inode->i_state |= I_SYNC; spin_unlock(&inode->i_lock); -- 1.8.1.4