> On Jul 26, 2016, at 12:32, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 25 Jul 2016, at 14:41, Benjamin Coddington wrote: > >> On 25 Jul 2016, at 14:34, Trond Myklebust wrote: >> >>>> On Jul 25, 2016, at 14:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 25 Jul 2016, at 12:39, Trond Myklebust wrote: >>>> >>>>>> On Jul 25, 2016, at 12:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>> >>>>>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote: >>>>>> >>>>>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> So back to Christoph's point earlier: >>>>>>>> >>>>>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote: >>>>>>>>> This one breaks xfstests generic/207 on block/scsi layout for me. The >>>>>>>>> reason for that is that we need a layoutcommit after writing out all >>>>>>>>> data for the file for the file size to be updated on the server. >>>>>>>> >>>>>>>> You responded: >>>>>>>> >>>>>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote: >>>>>>>>> I’m not understanding this argument. Why do we care if the file size is up >>>>>>>>> to date on the server if we’re not sending an actual GETATTR on the wire >>>>>>>>> to retrieve the file size? >>>>>>>> >>>>>>>> I guess the answer might be because we can get it back from the last >>>>>>>> LAYOUTCOMMIT. >>>>>>>> >>>>>>> >>>>>>> The patch that I followed up with should now ensure that we do not mark the attribute cache as up to date if there is a LAYOUTCOMMIT pending. >>>>>>> IOW: when the pNFS write is done, it is expected to do 2 things: >>>>>>> >>>>>>> 1) mark the inode for LAYOUTCOMMIT >>>>>>> 2) mark the attribute cache as invalid (because we know the change attribute, mtime, ctime need to be updates) >>>>>>> >>>>>>> In the case of blocks pNFS write: >>>>>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should take care of (1) >>>>>>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() should take care of (2). >>>>>>> >>>>>>> Provided that these 2 calls are performed in the above order, then any call to nfs_getattr() which has not been preceded by a call to nfs4_proc_layoutcommit() should trigger the call to __nfs_revalidate_inode(). >>>>>> >>>>>> I think the problem is that a following nfs_getattr() will fail to notice >>>>>> the size change in the case of a write_completion and layoutcommit occuring >>>>>> after nfs_getattr() has done pnfs_sync_inode() but before it has done >>>>>> nfs_update_inode(). >>>>>> >>>>>> In the failing case there are two threads one is doing writes, the other >>>>>> doing lstat on aio_complete via io_getevents(2). >>>>>> >>>>>> For each write completion the lstat thread tries to verify the file size. >>>>>> >>>>>> GETATTR Thread LAYOUTCOMMIT Thread >>>>>> -------------- -------------------- >>>>>> write_completion sets LAYOUTCOMMIT (4096@0) >>>>>> --> nfs_getattr >>>>> >>>>> filemap_write_and_wait() >>>>> >>>>>> __nfs_revalidate_inode >>>>>> pnfs_sync_inode >>>>> >>>>> NFS_PROTO(inode)->getattr() >>>>> >>>>>> getattr sees 4096 >>>>>> write_completion sets LAYOUTCOMMIT (4096@4096) >>>>>> sets LAYOUTCOMMITING >>>>>> clears LAYOUTCOMMIT >>>>>> clears LAYOUTCOMMITTING >>>>>> nfs_refresh_inode >>>>>> nfs_update_inode size is 4096 >>>>>> <-- nfs_getattr >>>>>> >>>>>> At this point the cached attributes are seen as up to date, but >>>>>> aio-dio-extend-stat program expects that second write_completion to reflect >>>>>> in the file size. >>>>>> >>>>> >>>>> Why isn’t the filemap_write_and_wait() above resolving the race? I’d >>>>> expect that would move your “write completion sets LAYOUTCOMMIT” up to >>>>> before the pnfs_sync_inode(). In fact, in the patch that Christoph sent, >>>>> all he was doing was moving the pnfs_sync_inode() to immediately after >>>>> that filemap_write_and_wait() instead of relying on it in >>>>> _nfs_revalidate_inode. >>>> >>>> This is O_DIRECT, I've failed to mention yet. The second write hasn't made >>>> it out of __nfs_pageio_add_request() at the time filemap_write_and_wait() is >>>> called. It is sleeping in pnfs_update_layout() waiting on a LAYOUTGET and it >>>> doesn't resumes until after filemap_write_and_wait(). >>> >>> Wait, so you have 1 thread doing an O_DIRECT write() and another doing a >>> stat() in parallel? Why would there be an expectation that the filesystem >>> should serialise those system calls? >> >> Not exactly parallel, but synchronized on aio_complete. A comment in >> generic/207's src/aio-dio-regress/aio-dio-extend-stat.c: >> >> 36 /* >> 37 * This was originally submitted to >> 38 * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by >> 39 * Rafal Wijata <wijata@xxxxxxxxxxxx>. It caught a race in dio aio completion >> 40 * that would call aio_complete() before the dio callers would update i_size. >> 41 * A stat after io_getevents() would not see the new file size. >> 42 * >> 43 * The bug was fixed in the fs/direct-io.c completion reworking that appeared >> 44 * in 2.6.20. This test should fail on 2.6.19. >> 45 */ >> >> As far as I can see, this check is the whole point of generic/207.. > > This would fix it up: > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c > index f108d58101f8..823700f827b6 100644 > --- a/fs/nfs/inode.c > +++ b/fs/nfs/inode.c > @@ -661,6 +661,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry > *dentry, struct kstat *stat) > > trace_nfs_getattr_enter(inode); > /* Flush out writes to the server in order to update c/mtime. */ > + nfs_start_io_read(inode); > if (S_ISREG(inode->i_mode)) { > err = filemap_write_and_wait(inode->i_mapping); > if (err) > @@ -694,6 +695,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry > *dentry, struct kstat *stat) > stat->blksize = NFS_SERVER(inode)->dtsize; > } > out: > + nfs_end_io_read(inode); > trace_nfs_getattr_exit(inode, err); > return err; > } > > Trond, what do you think? I'll take any additional silence as a sign to go > elsewhere. :P No. The above locking excludes all writes as well as O_DIRECT reads… That’s worse than we had before. I’d like rather to understand _why_ the aio_complete() is failing to work correctly here. According to the analysis of the test case that you quoted yesterday, the O_DIRECT writes should have completed before we even call stat(). Cheers Trond ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥