On Mon, 2017-08-21 at 11:35 +1000, NeilBrown wrote: > On Sat, Aug 19 2017, Trond Myklebust wrote: > > > On Sat, 2017-08-19 at 11:02 +1000, NeilBrown wrote: > > > On Fri, Aug 18 2017, Trond Myklebust wrote: > > > > > > > On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote: > > > > > Any dirty NFS page holds an s_active reference on the superblock, > > > > > because page_private() references an nfs_page, which references > > > > > an > > > > > open context, which references the superblock. > > > > > > > > > > So if there are any dirty pages when the filesystem is unmounted, > > > > > the > > > > > unmount will act like a "lazy" unmount and not call ->kill_sb(). > > > > > Background write-back can then write out the pages *after* the > > > > > filesystem > > > > > unmount has apparently completed. > > > > > > > > > > This contrasts with other filesystems which do not hold extra > > > > > s_active > > > > > references, so ->kill_sb() is reliably called on unmount, and > > > > > generic_shutdown_super() will call sync_filesystem() to flush > > > > > everything out before the unmount completes. > > > > > > > > > > When open/write/close is used to modify files, the final close > > > > > causes > > > > > f_op->flush to be called, which flushes all dirty pages. However > > > > > if > > > > > open/mmap/close/modify-memory/unmap is used, dirty pages can > > > > > remain > > > > > in > > > > > memory after the application has dropped all references to the > > > > > file. > > > > > Similarly if a loop-mount is done with a NFS file, there is no > > > > > final > > > > > flush when the loop device is destroyed and the file can have > > > > > dirty > > > > > data after the last "close". > > > > > > > > > > Also, a loop-back mount of a device does not "close" the file > > > > > when > > > > > the > > > > > loop device is destroyed. This can leave dirty page cache pages > > > > > too. > > > > > > > > > > Fix this by calling vfs_fsync() in nfs_file_release (aka > > > > > f_op->release()). This means that on the final unmap of a file > > > > > (or > > > > > destruction of a loop device), all changes are flushed, and > > > > > ensures > > > > > that > > > > > when unmount is requested there will be no dirty pages to delay > > > > > the > > > > > final unmount. > > > > > > > > > > Without this patch, it is not safe to stop or disconnect the NFS > > > > > server after all clients have unmounted. They need to unmount > > > > > and > > > > > call "sync". > > > > > > > > > > Signed-off-by: NeilBrown <neilb@xxxxxxxx> > > > > > --- > > > > > fs/nfs/file.c | 6 ++++++ > > > > > 1 file changed, 6 insertions(+) > > > > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c > > > > > index af330c31f627..aa883d8b24e6 100644 > > > > > --- a/fs/nfs/file.c > > > > > +++ b/fs/nfs/file.c > > > > > @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct > > > > > file > > > > > *filp) > > > > > { > > > > > dprintk("NFS: release(%pD2)\n", filp); > > > > > > > > > > + if (filp->f_mode & FMODE_WRITE) > > > > > + /* Ensure dirty mmapped pages are flushed > > > > > + * so there will be no dirty pages to > > > > > + * prevent an unmount from completing. > > > > > + */ > > > > > + vfs_fsync(filp, 0); > > > > > nfs_inc_stats(inode, NFSIOS_VFSRELEASE); > > > > > nfs_file_clear_open_context(filp); > > > > > return 0; > > > > > > > > The right fix here is to ensure that umount() flushes the dirty > > > > data, > > > > and that it aborts the umount attempt if the flush fails. > > > > Otherwise, a > > > > signal to the above fsync call will cause the problem to reoccur. > > > > > > The only way to ensure that umount flushes dirty data is to revert > > > commit 1daef0a86837 ("NFS: Clean up nfs_sb_active/nfs_sb_deactive"). > > > As long as we are taking extra references to s_active, umount won't > > > do > > > what it ought to do. > > > > > > > Reverting that patch is not sufficient. You'd still need to call > > sync_filesystem(), and abort the umount if the latter fails. > > generic_shutdown_super() (which nfs_kill_super() calls) already calls > sync_filesystem(). However sync_filesystem() cannot fail - it just > asks bdi workers to flush data out, then waits for completion. > > If the server is working, this should complete correctly. If it isn't > this will block indefinitely in the same way that a "sync" following the > umount will block indefinitely today. > > So while I agree that reverting this isn't sufficient, I think it might > be a step in the right direction. > > I've been experimenting with what happens when the server does down and > you try to unmount. With NFSv4, the state manager can still be trying > to send a RENEW and not want to let go until it gets an answer. I > haven't traced it all through yet to understand why or how significant > this is. > > > > > > > I do wonder what we should do when you try to unmount a filesystem > > > mounted from an inaccessible server. > > > Blocking is awkward for scripts to work with. Failing with EBUSY > > > might > > > be misleading. > > > I don't think that using MNT_FORCE guarantees success does it? It > > > would > > > need to discard any dirty data and abort any other background tasks. > > > > MNT_FORCE will only abort any outstanding RPC calls. It is constrained > > by the fact that we don't serialise umount and mount, and that the > > filesystem on which we are operating may be mounted in several places > > in the main namespace and indeed in several private namespaces to > > boot. > > Yes, that is a problem. > If we could arrange that SIGKILL *always* kills the process so that it > drops the reference to the filesystem, we could possible delay the > MNT_FORCE handling until deactivate_super(), but at present we sometimes > need MNT_FORCE to allow KILLed processes to die, and while there are > processes with file descriptors, umount will not get as far as > deactivate_super(). > (We need filemap_fdatawait_range() to be killable to do better.) > Indeed. I think we need to consider variants of these that are killable, and replace the existing calls with the killable variants in the appropriate places. That's a long term (and difficult) project though... > > > > > I think I would be in favor of EBUSY rather than a hang, and of > > > making > > > MNT_FORCE a silver bullet providing there are no open file > > > descriptors > > > on the filesystem. > > > > All other filesystems trigger sync_filesystem(), and that's what the > > VFS expects us to do. The problem is the VFS assumes that call will > > always succeed, and won't ever return an error. > > Yes, it is not easy to find an "ideal" solution. > > Given that fact, and given that the present patch does address an easily > demonstrated symptom, would you accept it as a small step in a good > direction? When the server is still accessible, and when no-one kills > processes at awkward times, it definitely helps. > > Even with a SIGKILL, I think the WRITEs and the COMMIT will be sent, but > the COMMIT won't be waited for... > > Thanks, > NeilBrown -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html