> On Jul 25, 2016, at 14:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > > > On 25 Jul 2016, at 12:39, Trond Myklebust wrote: > >>> On Jul 25, 2016, at 12:26, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>> >>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote: >>> >>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>> >>>>> So back to Christoph's point earlier: >>>>> >>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote: >>>>>> This one breaks xfstests generic/207 on block/scsi layout for me. The >>>>>> reason for that is that we need a layoutcommit after writing out all >>>>>> data for the file for the file size to be updated on the server. >>>>> >>>>> You responded: >>>>> >>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote: >>>>>> I’m not understanding this argument. Why do we care if the file size is up >>>>>> to date on the server if we’re not sending an actual GETATTR on the wire >>>>>> to retrieve the file size? >>>>> >>>>> I guess the answer might be because we can get it back from the last >>>>> LAYOUTCOMMIT. >>>>> >>>> >>>> The patch that I followed up with should now ensure that we do not mark the attribute cache as up to date if there is a LAYOUTCOMMIT pending. >>>> IOW: when the pNFS write is done, it is expected to do 2 things: >>>> >>>> 1) mark the inode for LAYOUTCOMMIT >>>> 2) mark the attribute cache as invalid (because we know the change attribute, mtime, ctime need to be updates) >>>> >>>> In the case of blocks pNFS write: >>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should take care of (1) >>>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() should take care of (2). >>>> >>>> Provided that these 2 calls are performed in the above order, then any call to nfs_getattr() which has not been preceded by a call to nfs4_proc_layoutcommit() should trigger the call to __nfs_revalidate_inode(). >>> >>> I think the problem is that a following nfs_getattr() will fail to notice >>> the size change in the case of a write_completion and layoutcommit occuring >>> after nfs_getattr() has done pnfs_sync_inode() but before it has done >>> nfs_update_inode(). >>> >>> In the failing case there are two threads one is doing writes, the other >>> doing lstat on aio_complete via io_getevents(2). >>> >>> For each write completion the lstat thread tries to verify the file size. >>> >>> GETATTR Thread LAYOUTCOMMIT Thread >>> -------------- -------------------- >>> write_completion sets LAYOUTCOMMIT (4096@0) >>> --> nfs_getattr >> >> filemap_write_and_wait() >> >>> __nfs_revalidate_inode >>> pnfs_sync_inode >> >> NFS_PROTO(inode)->getattr() >> >>> getattr sees 4096 >>> write_completion sets LAYOUTCOMMIT (4096@4096) >>> sets LAYOUTCOMMITING >>> clears LAYOUTCOMMIT >>> clears LAYOUTCOMMITTING >>> nfs_refresh_inode >>> nfs_update_inode size is 4096 >>> <-- nfs_getattr >>> >>> At this point the cached attributes are seen as up to date, but >>> aio-dio-extend-stat program expects that second write_completion to reflect >>> in the file size. >>> >> >> Why isn’t the filemap_write_and_wait() above resolving the race? I’d >> expect that would move your “write completion sets LAYOUTCOMMIT” up to >> before the pnfs_sync_inode(). In fact, in the patch that Christoph sent, >> all he was doing was moving the pnfs_sync_inode() to immediately after >> that filemap_write_and_wait() instead of relying on it in >> _nfs_revalidate_inode. > > This is O_DIRECT, I've failed to mention yet. The second write hasn't made > it out of __nfs_pageio_add_request() at the time filemap_write_and_wait() is > called. It is sleeping in pnfs_update_layout() waiting on a LAYOUTGET and it > doesn't resumes until after filemap_write_and_wait(). Wait, so you have 1 thread doing an O_DIRECT write() and another doing a stat() in parallel? Why would there be an expectation that the filesystem should serialise those system calls? ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥