On Thu, Mar 09 2017, Trond Myklebust wrote: > On Thu, 2017-03-09 at 15:46 +1100, NeilBrown wrote: >> On Thu, Mar 09 2017, NeilBrown wrote: >> >> > I've been chasing down a problem where a customer has a localhost >> > mount, >> > and the sequence >> > unmount -at nfs,nfs4 >> > stop nfsserver >> > sync >> > >> > hangs on the sync. The 'sync' is trying to write to the NFS >> > filesystem >> > that has just been unmounted. > > So why can't you move the sync up one line? > >> > I have duplicated the problem on a current mainline kernel. >> > >> > There are two important facts that lead to the explanation of this. >> > 1/ whenever a 'struct file' is open, an s_active reference is held >> > on >> > the superblock, via "open_context" calling nfs_sb_active(). >> > This doesn't prevent "unmount" from succeeding (i.e. EBUSY isn't >> > returned), but does prevent the actual unmount from happening >> > (->kill_sb() isn't called). >> > 2/ When a memory mapping of a file is torn down, the file is >> > "released", causing the context to be discarded and the >> > sb_active >> > reference released, but unlike close(2), file_operations- >> > >flush() >> > is not called. >> >> I realised that there is another piece of the puzzle. >> When a page is marked dirty (e.g. nfs_vm_page_mkwrite), >> nfs_updatepage() -> nfs_writepage_setup() >> creates a 'struct nfs_page' request which holds a reference to >> the open context, which in turn holds an active reference to the >> superblock. >> So as long as there are dirty pages, the superblock will not go >> inactive. All the rest still holds. >> >> NeilBrown >> >> >> > >> > Consequently, if you: >> > open an NFS file >> > mmap some pages PROT_WRITE >> > close the file >> > modify the pages >> > unmap the pages >> > unmount the filesystem >> > >> > the filesystem will remain active, and the pages will remain dirty. >> > If you then make the nfs server unavailable - e.g. stop it, or tear >> > down >> > the network connection - and then call 'sync', the sync will hang. >> > >> > This is surprising, at the least :-) >> > >> > I have two ideas how it might be fixed. >> > >> > One is to call nfs_file_flush() from within nfs_file_release(). >> > This is probably simplest (and appears to work). >> > >> > The other is to add a ".close" to nfs_file_vm_ops. This could >> > trigger a >> > (partial) flush whenever a page is unmapped. As closing an NFS >> > file >> > always triggers a flush, it seems reasonable that unmapping a page >> > would trigger a flush of that page. >> > >> > Thoughts? >> > > > All this is working as expected. The only way to ensure that data is > synced to disk by mmap() is to explicitly call msync(). This is well > documented, and should be well understood by application developers. If > those developers are ignoring the documentation, then I suggest votoing > with your feet and getting your software from someone else. Hi Trond, I don't understand how you can see this behaviour as acceptable. For any other filesystem, unmount is a synchronisation point. Assuming there are not bind mount or similar, and that MNT_DETACH is not used, an unmount will flush out any changes. It doesn't matter whether or not the application has called fsync or msync - the flush still happens on unmount. So with any other filesystem you can unmount and then remove the media and be sure that nothing is consistent and that no further writes will be attempted. With NFS as it currently is, unmount does not act as a synchronisation point, and writes can continue after the unmount. How is that acceptable? The patch below explains the problem and provides a simple fix. Thanks, NeilBrown From: NeilBrown <neilb@xxxxxxxx> Date: Fri, 10 Mar 2017 08:19:32 +1100 Subject: [PATCH] NFS: flush out dirty data on final fput(). Any dirty NFS page holds an s_active reference on the superblock, because page_private() references an nfs_page, which references an open context, which references the superblock. So if there are any dirty pages when the filesystem is unmounted, the unmount will act like a "lazy" unmount and not call ->kill_sb(). Background write-back can then write out the pages *after* the filesystem unmount has apparently completed. This contrasts with other filesystems which do not hold extra s_active references, so ->kill_sb() is reliably called on unmount, and generic_shutdown_super() will call sync_filesystem() to flush everything out before the unmount completes. When open/write/close is used to modify files, the final close causes f_op->flush to be called, which flushes all dirty pages. However if open/mmap/close/modify-memory/unmap is used, dirty pages can remain in memory after the application has dropped all references to the file. Fix this by calling vfs_fsync() in nfs_file_release (aka f_op->release()). This means that on the final unmap of a file, all changes are flushed, and ensures that when unmount is requested there will be no dirty pages to delay the final unmount. Without this patch, it is not safe to stop or disconnect the NFS server after all clients have unmounted. They need to unmount and call "sync". Signed-off-by: NeilBrown <neilb@xxxxxxxx> --- fs/nfs/file.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 668213984d68..d20cf9b3019b 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct file *filp) { dprintk("NFS: release(%pD2)\n", filp); + if (filp->f_mode & FMODE_WRITE) + /* Ensure dirty mmapped pages are flushed + * so there will be no dirty pages to + * prevent an unmount from completing. + */ + vfs_fsync(filp, 0); nfs_inc_stats(inode, NFSIOS_VFSRELEASE); nfs_file_clear_open_context(filp); return 0; -- 2.12.0
Attachment:
signature.asc
Description: PGP signature