On Fri, Aug 18 2017, Trond Myklebust wrote: > On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote: >> Any dirty NFS page holds an s_active reference on the superblock, >> because page_private() references an nfs_page, which references an >> open context, which references the superblock. >> >> So if there are any dirty pages when the filesystem is unmounted, the >> unmount will act like a "lazy" unmount and not call ->kill_sb(). >> Background write-back can then write out the pages *after* the >> filesystem >> unmount has apparently completed. >> >> This contrasts with other filesystems which do not hold extra >> s_active >> references, so ->kill_sb() is reliably called on unmount, and >> generic_shutdown_super() will call sync_filesystem() to flush >> everything out before the unmount completes. >> >> When open/write/close is used to modify files, the final close causes >> f_op->flush to be called, which flushes all dirty pages. However if >> open/mmap/close/modify-memory/unmap is used, dirty pages can remain >> in >> memory after the application has dropped all references to the file. >> Similarly if a loop-mount is done with a NFS file, there is no final >> flush when the loop device is destroyed and the file can have dirty >> data after the last "close". >> >> Also, a loop-back mount of a device does not "close" the file when >> the >> loop device is destroyed. This can leave dirty page cache pages too. >> >> Fix this by calling vfs_fsync() in nfs_file_release (aka >> f_op->release()). This means that on the final unmap of a file (or >> destruction of a loop device), all changes are flushed, and ensures >> that >> when unmount is requested there will be no dirty pages to delay the >> final unmount. >> >> Without this patch, it is not safe to stop or disconnect the NFS >> server after all clients have unmounted. They need to unmount and >> call "sync". >> >> Signed-off-by: NeilBrown <neilb@xxxxxxxx> >> --- >> fs/nfs/file.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/fs/nfs/file.c b/fs/nfs/file.c >> index af330c31f627..aa883d8b24e6 100644 >> --- a/fs/nfs/file.c >> +++ b/fs/nfs/file.c >> @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct file >> *filp) >> { >> dprintk("NFS: release(%pD2)\n", filp); >> >> + if (filp->f_mode & FMODE_WRITE) >> + /* Ensure dirty mmapped pages are flushed >> + * so there will be no dirty pages to >> + * prevent an unmount from completing. >> + */ >> + vfs_fsync(filp, 0); >> nfs_inc_stats(inode, NFSIOS_VFSRELEASE); >> nfs_file_clear_open_context(filp); >> return 0; > > The right fix here is to ensure that umount() flushes the dirty data, > and that it aborts the umount attempt if the flush fails. Otherwise, a > signal to the above fsync call will cause the problem to reoccur. The only way to ensure that umount flushes dirty data is to revert commit 1daef0a86837 ("NFS: Clean up nfs_sb_active/nfs_sb_deactive"). As long as we are taking extra references to s_active, umount won't do what it ought to do. I do wonder what we should do when you try to unmount a filesystem mounted from an inaccessible server. Blocking is awkward for scripts to work with. Failing with EBUSY might be misleading. I don't think that using MNT_FORCE guarantees success does it? It would need to discard any dirty data and abort any other background tasks. I think I would be in favor of EBUSY rather than a hang, and of making MNT_FORCE a silver bullet providing there are no open file descriptors on the filesystem. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature