Re: [PATCH/RFC] NFS: add nostatflush mount option.

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 5 Jan 2018 01:34:02 +0000

Hi Neil,

On Tue, 2018-01-02 at 10:29 +1100, NeilBrown wrote:
> On Sat, Dec 23 2017, Jeff Layton wrote:
> 
> > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
> > > On Thu, Dec 21 2017, Trond Myklebust wrote:
> > > 
> > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> > > > > Hi Neil-
> > > > > 
> > > > > 
> > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@xxxxxxxx>
> > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > When an i_op->getattr() call is made on an NFS file
> > > > > > (typically from a 'stat' family system call), NFS
> > > > > > will first flush any dirty data to the server.
> > > > > > 
> > > > > > This ensures that the mtime reported is correct and stable,
> > > > > > but has a performance penalty.  'stat' is normally thought
> > > > > > to be a quick operation, and imposing this cost can be
> > > > > > surprising.
> > > > > 
> > > > > To be clear, this behavior is a POSIX requirement.
> > > > > 
> > > > > 
> > > > > > I have seen problems when one process is writing a large
> > > > > > file and another process performs "ls -l" on the containing
> > > > > > directory and is blocked for as long as it take to flush
> > > > > > all the dirty data to the server, which can be minutes.
> > > > > 
> > > > > Yes, a well-known annoyance that cannot be addressed
> > > > > even with a write delegation.
> > > > > 
> > > > > 
> > > > > > I have also seen a legacy application which frequently
> > > > > > calls
> > > > > > "fstat" on a file that it is writing to.  On a local
> > > > > > filesystem (and in the Solaris implementation of NFS) this
> > > > > > fstat call is cheap.  On Linux/NFS, the causes a noticeable
> > > > > > decrease in throughput.
> > > > > 
> > > > > If the preceding write is small, Linux could be using
> > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
> > > > > 
> > > > > 
> > > > > > The only circumstances where an application calling
> > > > > > 'stat()'
> > > > > > might get an mtime which is not stable are times when some
> > > > > > other process is writing to the file and the two processes
> > > > > > are not using locking to ensure consistency, or when the
> > > > > > one
> > > > > > process is both writing and stating.  In neither of these
> > > > > > cases is it reasonable to expect the mtime to be stable.
> > > > > 
> > > > > I'm not convinced this is a strong enough rationale
> > > > > for claiming it is safe to disable the existing
> > > > > behavior.
> > > > > 
> > > > > You've explained cases where the new behavior is
> > > > > reasonable, but do you have any examples where the
> > > > > new behavior would be a problem? There must be a
> > > > > reason why POSIX explicitly requires an up-to-date
> > > > > mtime.
> > > > > 
> > > > > What guidance would nfs(5) give on when it is safe
> > > > > to specify the new mount option?
> > > > > 
> > > > > 
> > > > > > In the most common cases where mtime is important
> > > > > > (e.g. make), no other process has the file open, so there
> > > > > > will be no dirty data and the mtime will be stable.
> > > > > 
> > > > > Isn't it also the case that make is a multi-process
> > > > > workload where one process modifies a file, then
> > > > > closes it (which triggers a flush), and then another
> > > > > process stats the file? The new mount option does
> > > > > not change the behavior of close(2), does it?
> > > > > 
> > > > > 
> > > > > > Rather than unilaterally changing this behavior of 'stat',
> > > > > > this patch adds a "nosyncflush" mount option to allow
> > > > > > sysadmins to have applications which are hurt by the
> > > > > > current
> > > > > > behavior to disable it.
> > > > > 
> > > > > IMO a mount option is at the wrong granularity. A
> > > > > mount point will be shared between applications that
> > > > > can tolerate the non-POSIX behavior and those that
> > > > > cannot, for instance.
> > > > 
> > > > Agreed. 
> > > > 
> > > > The other thing to note here is that we now have an embryonic
> > > > statx()
> > > > system call, which allows the application itself to decide
> > > > whether or
> > > > not it needs up to date values for the atime/ctime/mtime. While
> > > > we
> > > > haven't yet plumbed in the NFS side, the intention was always
> > > > to use
> > > > that information to turn off the writeback flushing when
> > > > possible.
> > > 
> > > Yes, if statx() were actually working, we could change the
> > > application
> > > to avoid the flush.  But then if changing the application were an
> > > option, I suspect that - for my current customer issue - we could
> > > just
> > > remove the fstat() calls.  I doubt they are really necessary.
> > > I think programmers often think of stat() (and particularly
> > > fstat()) as
> > > fairly cheap and so they use it whenever convenient.  Only NFS
> > > violates
> > > this expectation.
> > > 
> > > Also statx() is only a real solution if/when it gets widely
> > > used.  Will
> > > "ls -l" default to AT_STATX_DONT_SYNC ??
> > > 
> > 
> > Maybe. Eventually, I could see glibc converting normal
> > stat/fstat/etc.
> > to use a statx() syscall under the hood (similar to how stat
> > syscalls on
> > 32-bit arches will use stat64 in most cases).
> > 
> > With that, we could look at any number of ways to sneak a "don't
> > flush"
> > flag into the call. Maybe an environment variable that causes the
> > stat
> > syscall wrapper to add it? I think there are possibilities there
> > that
> > don't necessarily require recompiling applications.
> 
> Thanks - interesting ideas.
> 
> One possibility would be an LD_PRELOAD which implements fstat() using
> statx().
> That doesn't address the "ls -l is needlessly slow" problem, but it
> would address the "legacy application calls fstat too often" problem.
> 
> This isn't an option for the "enterprise kernel" the customer is
> using
> (statx?  what is statx?), but having a clear view of a credible
> upstream solution is very helpful.
> 
> So thanks - and thanks a lot to Trond and Chuck for your input.  It
> helped clarify my thoughts a lot.
> 
> Is anyone working on proper statx support for NFS, or is it a case of
> "that shouldn't be hard and we should do that, but it isn't a high
> priority for anyone" ??

How about something like the following?

Cheers
  Trond

8<--------------------------------------------------------
From 755b6771deb8d793c90f56fddf7070d7c2ea87b5 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
Date: Thu, 4 Jan 2018 17:46:09 -0500
Subject: [PATCH] Support statx() mask and query flags parameters

Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute
revalidation, and AT_STATX_DONT_SYNC by returning cached attributes
only.

Use the mask to optimise away server revalidation for attributes
that are not being requested by the user.

Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
---
 fs/nfs/inode.c | 40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b112002dbdb2..a703b1d1500d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -735,12 +735,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_inode(path->dentry);
-	int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
+	unsigned long cache_validity;
+	bool force_sync = query_flags & AT_STATX_FORCE_SYNC;
+	bool dont_sync = !force_sync && query_flags & AT_STATX_DONT_SYNC;
+	bool need_atime = !dont_sync;
+	bool need_cmtime = !dont_sync;
+	bool reval = force_sync;
 	int err = 0;
 
+	if (!(request_mask & STATX_ATIME))
+		need_atime = false;
+	if (!(request_mask & (STATX_CTIME|STATX_MTIME)))
+		need_cmtime = false;
+
 	trace_nfs_getattr_enter(inode);
 	/* Flush out writes to the server in order to update c/mtime.  */
-	if (S_ISREG(inode->i_mode)) {
+	if (S_ISREG(inode->i_mode) && need_cmtime) {
 		err = filemap_write_and_wait(inode->i_mapping);
 		if (err)
 			goto out;
@@ -757,9 +767,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
 	 */
 	if ((path->mnt->mnt_flags & MNT_NOATIME) ||
 	    ((path->mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
-		need_atime = 0;
-
-	if (need_atime || nfs_need_revalidate_inode(inode)) {
+		need_atime = false;
+
+	/* Check for whether the cached attributes are invalid */
+	cache_validity = READ_ONCE(NFS_I(inode)->cache_validity);
+	if (need_cmtime)
+		reval |= cache_validity & NFS_INO_REVAL_PAGECACHE;
+	if (need_atime)
+		reval |= cache_validity & NFS_INO_INVALID_ATIME;
+	if (request_mask & (STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|
+				STATX_ATIME|STATX_MTIME|STATX_CTIME|
+				STATX_SIZE|STATX_BLOCKS))
+		reval |= cache_validity & NFS_INO_INVALID_ATTR;
+	if (dont_sync)
+		reval = false;
+
+	if (reval) {
 		struct nfs_server *server = NFS_SERVER(inode);
 
 		if (!(server->flags & NFS_MOUNT_NOAC))
@@ -767,13 +790,18 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
 		else
 			nfs_readdirplus_parent_cache_hit(path->dentry);
 		err = __nfs_revalidate_inode(server, inode);
-	} else
+	} else if (!dont_sync)
 		nfs_readdirplus_parent_cache_hit(path->dentry);
 	if (!err) {
 		generic_fillattr(inode, stat);
 		stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
 		if (S_ISDIR(inode->i_mode))
 			stat->blksize = NFS_SERVER(inode)->dtsize;
+		/* Return only the requested attrs if others may be stale */
+		if (!reval && cache_validity & (NFS_INO_REVAL_PAGECACHE|
+					NFS_INO_INVALID_ATIME|
+					NFS_INO_INVALID_ATTR))
+			stat->result_mask &= request_mask;
 	}
 out:
 	trace_nfs_getattr_exit(inode, err);
-- 
2.14.3


-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
Attachment:
signature.asc

Description: This is a digitally signed message part