Re: [PATCH v4 24/28] Getattr doesn't require data sync semantics

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Tue, 26 Jul 2016 13:57:40 -0400

On 26 Jul 2016, at 12:35, Trond Myklebust wrote:

On Jul 26, 2016, at 12:32, Benjamin Coddington <bcodding@xxxxxxxxxx> 
wrote:

On 25 Jul 2016, at 14:41, Benjamin Coddington wrote:

On 25 Jul 2016, at 14:34, Trond Myklebust wrote:

On Jul 25, 2016, at 14:26, Benjamin Coddington 
<bcodding@xxxxxxxxxx> wrote:



On 25 Jul 2016, at 12:39, Trond Myklebust wrote:

On Jul 25, 2016, at 12:26, Benjamin Coddington 
<bcodding@xxxxxxxxxx> wrote:

On 21 Jul 2016, at 9:20, Trond Myklebust wrote:

On Jul 21, 2016, at 09:05, Benjamin Coddington 
<bcodding@xxxxxxxxxx> wrote:

So back to Christoph's point earlier:

On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
This one breaks xfstests generic/207 on block/scsi layout for 
me.  The
reason for that is that we need a layoutcommit after writing 
out all
data for the file for the file size to be updated on the 
server.

You responded:

On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
I’m not understanding this argument. Why do we care if the 
file size is up
to date on the server if we’re not sending an actual 
GETATTR on the wire
to retrieve the file size?

I guess the answer might be because we can get it back from 
the last
LAYOUTCOMMIT.


The patch that I followed up with should now ensure that we do 
not mark the attribute cache as up to date if there is a 
LAYOUTCOMMIT pending.
IOW: when the pNFS write is done, it is expected to do 2 
things:

1) mark the inode for LAYOUTCOMMIT
2) mark the attribute cache as invalid (because we know the 
change attribute, mtime, ctime need to be updates)

In the case of blocks pNFS write:
The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() 
should take care of (1)
The call to nfs_writeback_update_inode() in 
nfs4_write_done_cb() should take care of (2).

Provided that these 2 calls are performed in the above order, 
then any call to nfs_getattr() which has not been preceded by a 
call to nfs4_proc_layoutcommit() should trigger the call to 
__nfs_revalidate_inode().

I think the problem is that a following nfs_getattr() will fail 
to notice
the size change in the case of a write_completion and 
layoutcommit occuring
after nfs_getattr() has done pnfs_sync_inode() but before it has 
done
nfs_update_inode().

In the failing case there are two threads one is doing writes, 
the other
doing lstat on aio_complete via io_getevents(2).

For each write completion the lstat thread tries to verify the 
file size.

GETATTR Thread                  LAYOUTCOMMIT Thread
--------------                  --------------------
                             write_completion sets LAYOUTCOMMIT 
(4096@0)
--> nfs_getattr

filemap_write_and_wait()

__nfs_revalidate_inode
pnfs_sync_inode

NFS_PROTO(inode)->getattr()

getattr sees 4096
                             write_completion sets LAYOUTCOMMIT 
(4096@4096)
                             sets LAYOUTCOMMITING
                             clears LAYOUTCOMMIT
                             clears LAYOUTCOMMITTING
nfs_refresh_inode
nfs_update_inode size is 4096
<-- nfs_getattr

At this point the cached attributes are seen as up to date, but
aio-dio-extend-stat program expects that second write_completion 
to reflect
in the file size.


Why isn’t the filemap_write_and_wait() above resolving the 
race? I’d
expect that would move your “write completion sets 
LAYOUTCOMMIT” up to
before the pnfs_sync_inode().  In fact, in the patch that 
Christoph sent,
all he was doing was moving the pnfs_sync_inode() to immediately 
after
that filemap_write_and_wait() instead of relying on it in
_nfs_revalidate_inode.

This is O_DIRECT, I've failed to mention yet.  The second write 
hasn't made
it out of __nfs_pageio_add_request() at the time 
filemap_write_and_wait() is
called.  It is sleeping in pnfs_update_layout() waiting on a 
LAYOUTGET and it
doesn't resumes until after filemap_write_and_wait().

Wait, so you have 1 thread doing an O_DIRECT write() and another 
doing a
stat() in parallel? Why would there be an expectation that the 
filesystem
should serialise those system calls?

Not exactly parallel, but synchronized on aio_complete.  A comment 
in
generic/207's src/aio-dio-regress/aio-dio-extend-stat.c:

36 /*
37  * This was originally submitted to
38  * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by
39  * Rafal Wijata <wijata@xxxxxxxxxxxx>.  It caught a race in dio 
aio completion
40  * that would call aio_complete() before the dio callers would 
update i_size.
41  * A stat after io_getevents() would not see the new file size.
42  *
43  * The bug was fixed in the fs/direct-io.c completion reworking 
that appeared
44  * in 2.6.20.  This test should fail on 2.6.19.
45  */

As far as I can see, this check is the whole point of generic/207..

This would fix it up:

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f108d58101f8..823700f827b6 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -661,6 +661,7 @@ int nfs_getattr(struct vfsmount *mnt, struct 
dentry
*dentry, struct kstat *stat)

       trace_nfs_getattr_enter(inode);
       /* Flush out writes to the server in order to update c/mtime.  
*/
+       nfs_start_io_read(inode);
       if (S_ISREG(inode->i_mode)) {
               err = filemap_write_and_wait(inode->i_mapping);
               if (err)
@@ -694,6 +695,7 @@ int nfs_getattr(struct vfsmount *mnt, struct 
dentry
*dentry, struct kstat *stat)
                       stat->blksize = NFS_SERVER(inode)->dtsize;
       }
out:
+       nfs_end_io_read(inode);
       trace_nfs_getattr_exit(inode, err);
       return err;
}

Trond, what do you think?  I'll take any additional silence as a sign 
to go
elsewhere.  :P

No. The above locking excludes all writes as well as O_DIRECT reads… 
That’s worse than we had before.

I’d like rather to understand _why_ the aio_complete() is failing to 
work correctly here. According to the analysis of the test case that 
you quoted yesterday, the O_DIRECT writes should have completed before 
we even call stat().

Cheers
  Trond

The O_DIRECT writes do complete, and every completion signals the other
thread to do stat(), but that completion does not update the size on the
server.  As we know, we need a LAYOUTCOMMIT.  After this patch, we're 
only
going to do a LAYOUTCOMMIT if nfs_need_revalidate_inode(inode).

So what happens is that the first write completes, and the first
nfs_getattr() triggers the first LAYOUTCOMMIT, and then a GETATTR.
Simultaneously, the second write is waiting on a LAYOUTGET.  The GETATTR
completes and sets the size to 4k.

Now the attributes are marked up to date with a size of 4k, and when the
second write completes, nfs_getattr() is called, 
nfs_need_revalidate_inode()
is false, and we don't bother to send another LAYOUTCOMMIT or GETATTR to
correct the cached file size.

Here's a function graph of that: 
http://people.redhat.com/bcodding/borken

We could invalidate the attribute cache every time a write completes.. 
maybe
nfs_writeback_update_inode() isn't doing the job for block layouts (are 
we
not setting res.count?  I'll look at that..)

I think we could also use the LAYOUTCOMMIT results to invalidate the 
cache,
RFC-5661 18.42.3:
   If the metadata server updates the file's size as the
   result of the LAYOUTCOMMIT operation, it must return the new size
   (locr_newsize.ns_size) as part of the results."

I'm not sure you want to bother to try to reproduce -- but if so, you 
don't
need special hardware for SCSI layout.  I wrote up a quick how-to for 
SCSI
layouts on a VM in qemu:
http://people.redhat.com/bcodding/pnfs/nfs/scsi/2016/07/13/pnfs_scsi_setup_for_VMs/

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html