Re: [PATCH 4/8] NFS: flush out dirty data on file fput().

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Sat, 19 Aug 2017 01:27:20 +0000

On Sat, 2017-08-19 at 11:02 +1000, NeilBrown wrote:
> On Fri, Aug 18 2017, Trond Myklebust wrote:
> 
> > On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote:
> > > Any dirty NFS page holds an s_active reference on the superblock,
> > > because page_private() references an nfs_page, which references
> > > an
> > > open context, which references the superblock.
> > > 
> > > So if there are any dirty pages when the filesystem is unmounted,
> > > the
> > > unmount will act like a "lazy" unmount and not call ->kill_sb().
> > > Background write-back can then write out the pages *after* the
> > > filesystem
> > > unmount has apparently completed.
> > > 
> > > This contrasts with other filesystems which do not hold extra
> > > s_active
> > > references, so ->kill_sb() is reliably called on unmount, and
> > > generic_shutdown_super() will call sync_filesystem() to flush
> > > everything out before the unmount completes.
> > > 
> > > When open/write/close is used to modify files, the final close
> > > causes
> > > f_op->flush to be called, which flushes all dirty pages.  However
> > > if
> > > open/mmap/close/modify-memory/unmap is used, dirty pages can
> > > remain
> > > in
> > > memory after the application has dropped all references to the
> > > file.
> > > Similarly if a loop-mount is done with a NFS file, there is no
> > > final
> > > flush when the loop device is destroyed and the file can have
> > > dirty
> > > data after the last "close".
> > > 
> > > Also, a loop-back mount of a device does not "close" the file
> > > when
> > > the
> > > loop device is destroyed.  This can leave dirty page cache pages
> > > too.
> > > 
> > > Fix this by calling vfs_fsync() in nfs_file_release (aka
> > > f_op->release()).  This means that on the final unmap of a file
> > > (or
> > > destruction of a loop device), all changes are flushed, and
> > > ensures
> > > that
> > > when unmount is requested there will be no dirty pages to delay
> > > the
> > > final unmount.
> > > 
> > > Without this patch, it is not safe to stop or disconnect the NFS
> > > server after all clients have unmounted.  They need to unmount
> > > and
> > > call "sync".
> > > 
> > > Signed-off-by: NeilBrown <neilb@xxxxxxxx>
> > > ---
> > >  fs/nfs/file.c |    6 ++++++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > index af330c31f627..aa883d8b24e6 100644
> > > --- a/fs/nfs/file.c
> > > +++ b/fs/nfs/file.c
> > > @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct
> > > file
> > > *filp)
> > >  {
> > >  	dprintk("NFS: release(%pD2)\n", filp);
> > >  
> > > +	if (filp->f_mode & FMODE_WRITE)
> > > +		/* Ensure dirty mmapped pages are flushed
> > > +		 * so there will be no dirty pages to
> > > +		 * prevent an unmount from completing.
> > > +		 */
> > > +		vfs_fsync(filp, 0);
> > >  	nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
> > >  	nfs_file_clear_open_context(filp);
> > >  	return 0;
> > 
> > The right fix here is to ensure that umount() flushes the dirty
> > data,
> > and that it aborts the umount attempt if the flush fails.
> > Otherwise, a
> > signal to the above fsync call will cause the problem to reoccur.
> 
> The only way to ensure that umount flushes dirty data is to revert
> commit 1daef0a86837 ("NFS: Clean up nfs_sb_active/nfs_sb_deactive").
> As long as we are taking extra references to s_active, umount won't
> do
> what it ought to do.
> 

Reverting that patch is not sufficient. You'd still need to call
sync_filesystem(), and abort the umount if the latter fails.

> I do wonder what we should do when you try to unmount a filesystem
> mounted from an inaccessible server.
> Blocking is awkward for scripts to work with.  Failing with EBUSY
> might
> be misleading.
> I don't think that using MNT_FORCE guarantees success does it?  It
> would
> need to discard any dirty data and abort any other background tasks.

MNT_FORCE will only abort any outstanding RPC calls. It is constrained
by the fact that we don't serialise umount and mount, and that the
filesystem on which we are operating may be mounted in several places
in the main namespace and indeed in several private namespaces to boot.

> I think I would be in favor of EBUSY rather than a hang, and of
> making
> MNT_FORCE a silver bullet providing there are no open file
> descriptors
> on the filesystem.

All other filesystems trigger sync_filesystem(), and that's what the
VFS expects us to do. The problem is the VFS assumes that call will
always succeed, and won't ever return an error.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥