> On Jul 26, 2016, at 13:57, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 26 Jul 2016, at 12:35, Trond Myklebust wrote: > >>> On Jul 26, 2016, at 12:32, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>> >>> On 25 Jul 2016, at 14:41, Benjamin Coddington wrote: >>> >>>> On 25 Jul 2016, at 14:34, Trond Myklebust wrote: >>>> >>>>>> On Jul 25, 2016, at 14:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 25 Jul 2016, at 12:39, Trond Myklebust wrote: >>>>>> >>>>>>>> On Jul 25, 2016, at 12:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote: >>>>>>>> >>>>>>>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> So back to Christoph's point earlier: >>>>>>>>>> >>>>>>>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote: >>>>>>>>>>> This one breaks xfstests generic/207 on block/scsi layout for me. The >>>>>>>>>>> reason for that is that we need a layoutcommit after writing out all >>>>>>>>>>> data for the file for the file size to be updated on the server. >>>>>>>>>> >>>>>>>>>> You responded: >>>>>>>>>> >>>>>>>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote: >>>>>>>>>>> I’m not understanding this argument. Why do we care if the file size is up >>>>>>>>>>> to date on the server if we’re not sending an actual GETATTR on the wire >>>>>>>>>>> to retrieve the file size? >>>>>>>>>> >>>>>>>>>> I guess the answer might be because we can get it back from the last >>>>>>>>>> LAYOUTCOMMIT. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The patch that I followed up with should now ensure that we do not mark the attribute cache as up to date if there is a LAYOUTCOMMIT pending. >>>>>>>>> IOW: when the pNFS write is done, it is expected to do 2 things: >>>>>>>>> >>>>>>>>> 1) mark the inode for LAYOUTCOMMIT >>>>>>>>> 2) mark the attribute cache as invalid (because we know the change attribute, mtime, ctime need to be updates) >>>>>>>>> >>>>>>>>> In the case of blocks pNFS write: >>>>>>>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should take care of (1) >>>>>>>>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() should take care of (2). >>>>>>>>> >>>>>>>>> Provided that these 2 calls are performed in the above order, then any call to nfs_getattr() which has not been preceded by a call to nfs4_proc_layoutcommit() should trigger the call to __nfs_revalidate_inode(). >>>>>>>> >>>>>>>> I think the problem is that a following nfs_getattr() will fail to notice >>>>>>>> the size change in the case of a write_completion and layoutcommit occuring >>>>>>>> after nfs_getattr() has done pnfs_sync_inode() but before it has done >>>>>>>> nfs_update_inode(). >>>>>>>> >>>>>>>> In the failing case there are two threads one is doing writes, the other >>>>>>>> doing lstat on aio_complete via io_getevents(2). >>>>>>>> >>>>>>>> For each write completion the lstat thread tries to verify the file size. >>>>>>>> >>>>>>>> GETATTR Thread LAYOUTCOMMIT Thread >>>>>>>> -------------- -------------------- >>>>>>>> write_completion sets LAYOUTCOMMIT (4096@0) >>>>>>>> --> nfs_getattr >>>>>>> >>>>>>> filemap_write_and_wait() >>>>>>> >>>>>>>> __nfs_revalidate_inode >>>>>>>> pnfs_sync_inode >>>>>>> >>>>>>> NFS_PROTO(inode)->getattr() >>>>>>> >>>>>>>> getattr sees 4096 >>>>>>>> write_completion sets LAYOUTCOMMIT (4096@4096) >>>>>>>> sets LAYOUTCOMMITING >>>>>>>> clears LAYOUTCOMMIT >>>>>>>> clears LAYOUTCOMMITTING >>>>>>>> nfs_refresh_inode >>>>>>>> nfs_update_inode size is 4096 >>>>>>>> <-- nfs_getattr >>>>>>>> >>>>>>>> At this point the cached attributes are seen as up to date, but >>>>>>>> aio-dio-extend-stat program expects that second write_completion to reflect >>>>>>>> in the file size. >>>>>>>> >>>>>>> >>>>>>> Why isn’t the filemap_write_and_wait() above resolving the race? I’d >>>>>>> expect that would move your “write completion sets LAYOUTCOMMIT” up to >>>>>>> before the pnfs_sync_inode(). In fact, in the patch that Christoph sent, >>>>>>> all he was doing was moving the pnfs_sync_inode() to immediately after >>>>>>> that filemap_write_and_wait() instead of relying on it in >>>>>>> _nfs_revalidate_inode. >>>>>> >>>>>> This is O_DIRECT, I've failed to mention yet. The second write hasn't made >>>>>> it out of __nfs_pageio_add_request() at the time filemap_write_and_wait() is >>>>>> called. It is sleeping in pnfs_update_layout() waiting on a LAYOUTGET and it >>>>>> doesn't resumes until after filemap_write_and_wait(). >>>>> >>>>> Wait, so you have 1 thread doing an O_DIRECT write() and another doing a >>>>> stat() in parallel? Why would there be an expectation that the filesystem >>>>> should serialise those system calls? >>>> >>>> Not exactly parallel, but synchronized on aio_complete. A comment in >>>> generic/207's src/aio-dio-regress/aio-dio-extend-stat.c: >>>> >>>> 36 /* >>>> 37 * This was originally submitted to >>>> 38 * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by >>>> 39 * Rafal Wijata <wijata@xxxxxxxxxxxx>. It caught a race in dio aio completion >>>> 40 * that would call aio_complete() before the dio callers would update i_size. >>>> 41 * A stat after io_getevents() would not see the new file size. >>>> 42 * >>>> 43 * The bug was fixed in the fs/direct-io.c completion reworking that appeared >>>> 44 * in 2.6.20. This test should fail on 2.6.19. >>>> 45 */ >>>> >>>> As far as I can see, this check is the whole point of generic/207.. >>> >>> This would fix it up: >>> >>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c >>> index f108d58101f8..823700f827b6 100644 >>> --- a/fs/nfs/inode.c >>> +++ b/fs/nfs/inode.c >>> @@ -661,6 +661,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry >>> *dentry, struct kstat *stat) >>> >>> trace_nfs_getattr_enter(inode); >>> /* Flush out writes to the server in order to update c/mtime. */ >>> + nfs_start_io_read(inode); >>> if (S_ISREG(inode->i_mode)) { >>> err = filemap_write_and_wait(inode->i_mapping); >>> if (err) >>> @@ -694,6 +695,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry >>> *dentry, struct kstat *stat) >>> stat->blksize = NFS_SERVER(inode)->dtsize; >>> } >>> out: >>> + nfs_end_io_read(inode); >>> trace_nfs_getattr_exit(inode, err); >>> return err; >>> } >>> >>> Trond, what do you think? I'll take any additional silence as a sign to go >>> elsewhere. :P >> >> No. The above locking excludes all writes as well as O_DIRECT reads… That’s worse than we had before. >> >> I’d like rather to understand _why_ the aio_complete() is failing to work correctly here. According to the analysis of the test case that you quoted yesterday, the O_DIRECT writes should have completed before we even call stat(). >> >> Cheers >> Trond > > The O_DIRECT writes do complete, and every completion signals the other > thread to do stat(), but that completion does not update the size on the > server. As we know, we need a LAYOUTCOMMIT. After this patch, we're only > going to do a LAYOUTCOMMIT if nfs_need_revalidate_inode(inode). > So how is the completion happening? As far as I know, this is what is supposed to happen: - bl_write_cleanup() calls pnfs_ld_write_done(), - pnfs_ld_write_done then first calls pnfs_set_layoutcommit() and nfs_pgio_result() (which calls nfs_writeback_done() and eventually nfs_writeback_update_inode()) - pnfs_ld_write_done then calls nfs_pgio_release(), which again calls nfs_direct_write_completion(). Is something setting hdr->pnfs_error and preventing the call to pnfs_set_layoutcommit and nfs_writeback_done()? If not, then why is the call to nfs_writeback_update_inode() not setting the file size? > So what happens is that the first write completes, and the first > nfs_getattr() triggers the first LAYOUTCOMMIT, and then a GETATTR. > Simultaneously, the second write is waiting on a LAYOUTGET. The GETATTR > completes and sets the size to 4k. > > Now the attributes are marked up to date with a size of 4k, and when the > second write completes, nfs_getattr() is called, nfs_need_revalidate_inode() > is false, and we don't bother to send another LAYOUTCOMMIT or GETATTR to > correct the cached file size. > > Here's a function graph of that: http://people.redhat.com/bcodding/borken > > We could invalidate the attribute cache every time a write completes.. maybe > nfs_writeback_update_inode() isn't doing the job for block layouts (are we > not setting res.count? I'll look at that..) > > I think we could also use the LAYOUTCOMMIT results to invalidate the cache, > RFC-5661 18.42.3: > If the metadata server updates the file's size as the > result of the LAYOUTCOMMIT operation, it must return the new size > (locr_newsize.ns_size) as part of the results." > > I'm not sure you want to bother to try to reproduce -- but if so, you don't > need special hardware for SCSI layout. I wrote up a quick how-to for SCSI > layouts on a VM in qemu: > http://people.redhat.com/bcodding/pnfs/nfs/scsi/2016/07/13/pnfs_scsi_setup_for_VMs/ > > Ben ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥