Re: [PATCH v4 24/28] Getattr doesn't require data sync semantics

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 26 Jul 2016 18:07:51 +0000

> On Jul 26, 2016, at 13:57, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> 
> On 26 Jul 2016, at 12:35, Trond Myklebust wrote:
> 
>>> On Jul 26, 2016, at 12:32, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>> 
>>> On 25 Jul 2016, at 14:41, Benjamin Coddington wrote:
>>> 
>>>> On 25 Jul 2016, at 14:34, Trond Myklebust wrote:
>>>> 
>>>>>> On Jul 25, 2016, at 14:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 25 Jul 2016, at 12:39, Trond Myklebust wrote:
>>>>>> 
>>>>>>>> On Jul 25, 2016, at 12:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote:
>>>>>>>> 
>>>>>>>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> So back to Christoph's point earlier:
>>>>>>>>>> 
>>>>>>>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
>>>>>>>>>>> This one breaks xfstests generic/207 on block/scsi layout for me.  The
>>>>>>>>>>> reason for that is that we need a layoutcommit after writing out all
>>>>>>>>>>> data for the file for the file size to be updated on the server.
>>>>>>>>>> 
>>>>>>>>>> You responded:
>>>>>>>>>> 
>>>>>>>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
>>>>>>>>>>> I’m not understanding this argument. Why do we care if the file size is up
>>>>>>>>>>> to date on the server if we’re not sending an actual GETATTR on the wire
>>>>>>>>>>> to retrieve the file size?
>>>>>>>>>> 
>>>>>>>>>> I guess the answer might be because we can get it back from the last
>>>>>>>>>> LAYOUTCOMMIT.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The patch that I followed up with should now ensure that we do not mark the attribute cache as up to date if there is a LAYOUTCOMMIT pending.
>>>>>>>>> IOW: when the pNFS write is done, it is expected to do 2 things:
>>>>>>>>> 
>>>>>>>>> 1) mark the inode for LAYOUTCOMMIT
>>>>>>>>> 2) mark the attribute cache as invalid (because we know the change attribute, mtime, ctime need to be updates)
>>>>>>>>> 
>>>>>>>>> In the case of blocks pNFS write:
>>>>>>>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should take care of (1)
>>>>>>>>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() should take care of (2).
>>>>>>>>> 
>>>>>>>>> Provided that these 2 calls are performed in the above order, then any call to nfs_getattr() which has not been preceded by a call to nfs4_proc_layoutcommit() should trigger the call to __nfs_revalidate_inode().
>>>>>>>> 
>>>>>>>> I think the problem is that a following nfs_getattr() will fail to notice
>>>>>>>> the size change in the case of a write_completion and layoutcommit occuring
>>>>>>>> after nfs_getattr() has done pnfs_sync_inode() but before it has done
>>>>>>>> nfs_update_inode().
>>>>>>>> 
>>>>>>>> In the failing case there are two threads one is doing writes, the other
>>>>>>>> doing lstat on aio_complete via io_getevents(2).
>>>>>>>> 
>>>>>>>> For each write completion the lstat thread tries to verify the file size.
>>>>>>>> 
>>>>>>>> GETATTR Thread                  LAYOUTCOMMIT Thread
>>>>>>>> --------------                  --------------------
>>>>>>>>                             write_completion sets LAYOUTCOMMIT (4096@0)
>>>>>>>> --> nfs_getattr
>>>>>>> 
>>>>>>> filemap_write_and_wait()
>>>>>>> 
>>>>>>>> __nfs_revalidate_inode
>>>>>>>> pnfs_sync_inode
>>>>>>> 
>>>>>>> NFS_PROTO(inode)->getattr()
>>>>>>> 
>>>>>>>> getattr sees 4096
>>>>>>>>                             write_completion sets LAYOUTCOMMIT (4096@4096)
>>>>>>>>                             sets LAYOUTCOMMITING
>>>>>>>>                             clears LAYOUTCOMMIT
>>>>>>>>                             clears LAYOUTCOMMITTING
>>>>>>>> nfs_refresh_inode
>>>>>>>> nfs_update_inode size is 4096
>>>>>>>> <-- nfs_getattr
>>>>>>>> 
>>>>>>>> At this point the cached attributes are seen as up to date, but
>>>>>>>> aio-dio-extend-stat program expects that second write_completion to reflect
>>>>>>>> in the file size.
>>>>>>>> 
>>>>>>> 
>>>>>>> Why isn’t the filemap_write_and_wait() above resolving the race? I’d
>>>>>>> expect that would move your “write completion sets LAYOUTCOMMIT” up to
>>>>>>> before the pnfs_sync_inode().  In fact, in the patch that Christoph sent,
>>>>>>> all he was doing was moving the pnfs_sync_inode() to immediately after
>>>>>>> that filemap_write_and_wait() instead of relying on it in
>>>>>>> _nfs_revalidate_inode.
>>>>>> 
>>>>>> This is O_DIRECT, I've failed to mention yet.  The second write hasn't made
>>>>>> it out of __nfs_pageio_add_request() at the time filemap_write_and_wait() is
>>>>>> called.  It is sleeping in pnfs_update_layout() waiting on a LAYOUTGET and it
>>>>>> doesn't resumes until after filemap_write_and_wait().
>>>>> 
>>>>> Wait, so you have 1 thread doing an O_DIRECT write() and another doing a
>>>>> stat() in parallel? Why would there be an expectation that the filesystem
>>>>> should serialise those system calls?
>>>> 
>>>> Not exactly parallel, but synchronized on aio_complete.  A comment in
>>>> generic/207's src/aio-dio-regress/aio-dio-extend-stat.c:
>>>> 
>>>> 36 /*
>>>> 37  * This was originally submitted to
>>>> 38  * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by
>>>> 39  * Rafal Wijata <wijata@xxxxxxxxxxxx>.  It caught a race in dio aio completion
>>>> 40  * that would call aio_complete() before the dio callers would update i_size.
>>>> 41  * A stat after io_getevents() would not see the new file size.
>>>> 42  *
>>>> 43  * The bug was fixed in the fs/direct-io.c completion reworking that appeared
>>>> 44  * in 2.6.20.  This test should fail on 2.6.19.
>>>> 45  */
>>>> 
>>>> As far as I can see, this check is the whole point of generic/207..
>>> 
>>> This would fix it up:
>>> 
>>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>>> index f108d58101f8..823700f827b6 100644
>>> --- a/fs/nfs/inode.c
>>> +++ b/fs/nfs/inode.c
>>> @@ -661,6 +661,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry
>>> *dentry, struct kstat *stat)
>>> 
>>>       trace_nfs_getattr_enter(inode);
>>>       /* Flush out writes to the server in order to update c/mtime.  */
>>> +       nfs_start_io_read(inode);
>>>       if (S_ISREG(inode->i_mode)) {
>>>               err = filemap_write_and_wait(inode->i_mapping);
>>>               if (err)
>>> @@ -694,6 +695,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry
>>> *dentry, struct kstat *stat)
>>>                       stat->blksize = NFS_SERVER(inode)->dtsize;
>>>       }
>>> out:
>>> +       nfs_end_io_read(inode);
>>>       trace_nfs_getattr_exit(inode, err);
>>>       return err;
>>> }
>>> 
>>> Trond, what do you think?  I'll take any additional silence as a sign to go
>>> elsewhere.  :P
>> 
>> No. The above locking excludes all writes as well as O_DIRECT reads… That’s worse than we had before.
>> 
>> I’d like rather to understand _why_ the aio_complete() is failing to work correctly here. According to the analysis of the test case that you quoted yesterday, the O_DIRECT writes should have completed before we even call stat().
>> 
>> Cheers
>>  Trond
> 
> The O_DIRECT writes do complete, and every completion signals the other
> thread to do stat(), but that completion does not update the size on the
> server.  As we know, we need a LAYOUTCOMMIT.  After this patch, we're only
> going to do a LAYOUTCOMMIT if nfs_need_revalidate_inode(inode).
> 

So how is the completion happening? As far as I know, this is what is supposed to happen:

- bl_write_cleanup() calls pnfs_ld_write_done(),
    - pnfs_ld_write_done then first calls pnfs_set_layoutcommit() and nfs_pgio_result() (which calls nfs_writeback_done() and eventually nfs_writeback_update_inode())
    - pnfs_ld_write_done then calls nfs_pgio_release(), which again calls nfs_direct_write_completion().

Is something setting hdr->pnfs_error and preventing the call to pnfs_set_layoutcommit and nfs_writeback_done()? If not, then why is the call to nfs_writeback_update_inode() not setting the file size?

> So what happens is that the first write completes, and the first
> nfs_getattr() triggers the first LAYOUTCOMMIT, and then a GETATTR.
> Simultaneously, the second write is waiting on a LAYOUTGET.  The GETATTR
> completes and sets the size to 4k.
> 
> Now the attributes are marked up to date with a size of 4k, and when the
> second write completes, nfs_getattr() is called, nfs_need_revalidate_inode()
> is false, and we don't bother to send another LAYOUTCOMMIT or GETATTR to
> correct the cached file size.
> 
> Here's a function graph of that: http://people.redhat.com/bcodding/borken
> 
> We could invalidate the attribute cache every time a write completes.. maybe
> nfs_writeback_update_inode() isn't doing the job for block layouts (are we
> not setting res.count?  I'll look at that..)
> 
> I think we could also use the LAYOUTCOMMIT results to invalidate the cache,
> RFC-5661 18.42.3:
>   If the metadata server updates the file's size as the
>   result of the LAYOUTCOMMIT operation, it must return the new size
>   (locr_newsize.ns_size) as part of the results."
> 
> I'm not sure you want to bother to try to reproduce -- but if so, you don't
> need special hardware for SCSI layout.  I wrote up a quick how-to for SCSI
> layouts on a VM in qemu:
> http://people.redhat.com/bcodding/pnfs/nfs/scsi/2016/07/13/pnfs_scsi_setup_for_VMs/
> 
> Ben

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥