Hi Neil, On Tue, 2018-01-02 at 10:29 +1100, NeilBrown wrote: > On Sat, Dec 23 2017, Jeff Layton wrote: > > > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote: > > > On Thu, Dec 21 2017, Trond Myklebust wrote: > > > > > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote: > > > > > Hi Neil- > > > > > > > > > > > > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@xxxxxxxx> > > > > > > wrote: > > > > > > > > > > > > > > > > > > When an i_op->getattr() call is made on an NFS file > > > > > > (typically from a 'stat' family system call), NFS > > > > > > will first flush any dirty data to the server. > > > > > > > > > > > > This ensures that the mtime reported is correct and stable, > > > > > > but has a performance penalty. 'stat' is normally thought > > > > > > to be a quick operation, and imposing this cost can be > > > > > > surprising. > > > > > > > > > > To be clear, this behavior is a POSIX requirement. > > > > > > > > > > > > > > > > I have seen problems when one process is writing a large > > > > > > file and another process performs "ls -l" on the containing > > > > > > directory and is blocked for as long as it take to flush > > > > > > all the dirty data to the server, which can be minutes. > > > > > > > > > > Yes, a well-known annoyance that cannot be addressed > > > > > even with a write delegation. > > > > > > > > > > > > > > > > I have also seen a legacy application which frequently > > > > > > calls > > > > > > "fstat" on a file that it is writing to. On a local > > > > > > filesystem (and in the Solaris implementation of NFS) this > > > > > > fstat call is cheap. On Linux/NFS, the causes a noticeable > > > > > > decrease in throughput. > > > > > > > > > > If the preceding write is small, Linux could be using > > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE. > > > > > > > > > > > > > > > > The only circumstances where an application calling > > > > > > 'stat()' > > > > > > might get an mtime which is not stable are times when some > > > > > > other process is writing to the file and the two processes > > > > > > are not using locking to ensure consistency, or when the > > > > > > one > > > > > > process is both writing and stating. In neither of these > > > > > > cases is it reasonable to expect the mtime to be stable. > > > > > > > > > > I'm not convinced this is a strong enough rationale > > > > > for claiming it is safe to disable the existing > > > > > behavior. > > > > > > > > > > You've explained cases where the new behavior is > > > > > reasonable, but do you have any examples where the > > > > > new behavior would be a problem? There must be a > > > > > reason why POSIX explicitly requires an up-to-date > > > > > mtime. > > > > > > > > > > What guidance would nfs(5) give on when it is safe > > > > > to specify the new mount option? > > > > > > > > > > > > > > > > In the most common cases where mtime is important > > > > > > (e.g. make), no other process has the file open, so there > > > > > > will be no dirty data and the mtime will be stable. > > > > > > > > > > Isn't it also the case that make is a multi-process > > > > > workload where one process modifies a file, then > > > > > closes it (which triggers a flush), and then another > > > > > process stats the file? The new mount option does > > > > > not change the behavior of close(2), does it? > > > > > > > > > > > > > > > > Rather than unilaterally changing this behavior of 'stat', > > > > > > this patch adds a "nosyncflush" mount option to allow > > > > > > sysadmins to have applications which are hurt by the > > > > > > current > > > > > > behavior to disable it. > > > > > > > > > > IMO a mount option is at the wrong granularity. A > > > > > mount point will be shared between applications that > > > > > can tolerate the non-POSIX behavior and those that > > > > > cannot, for instance. > > > > > > > > Agreed. > > > > > > > > The other thing to note here is that we now have an embryonic > > > > statx() > > > > system call, which allows the application itself to decide > > > > whether or > > > > not it needs up to date values for the atime/ctime/mtime. While > > > > we > > > > haven't yet plumbed in the NFS side, the intention was always > > > > to use > > > > that information to turn off the writeback flushing when > > > > possible. > > > > > > Yes, if statx() were actually working, we could change the > > > application > > > to avoid the flush. But then if changing the application were an > > > option, I suspect that - for my current customer issue - we could > > > just > > > remove the fstat() calls. I doubt they are really necessary. > > > I think programmers often think of stat() (and particularly > > > fstat()) as > > > fairly cheap and so they use it whenever convenient. Only NFS > > > violates > > > this expectation. > > > > > > Also statx() is only a real solution if/when it gets widely > > > used. Will > > > "ls -l" default to AT_STATX_DONT_SYNC ?? > > > > > > > Maybe. Eventually, I could see glibc converting normal > > stat/fstat/etc. > > to use a statx() syscall under the hood (similar to how stat > > syscalls on > > 32-bit arches will use stat64 in most cases). > > > > With that, we could look at any number of ways to sneak a "don't > > flush" > > flag into the call. Maybe an environment variable that causes the > > stat > > syscall wrapper to add it? I think there are possibilities there > > that > > don't necessarily require recompiling applications. > > Thanks - interesting ideas. > > One possibility would be an LD_PRELOAD which implements fstat() using > statx(). > That doesn't address the "ls -l is needlessly slow" problem, but it > would address the "legacy application calls fstat too often" problem. > > This isn't an option for the "enterprise kernel" the customer is > using > (statx? what is statx?), but having a clear view of a credible > upstream solution is very helpful. > > So thanks - and thanks a lot to Trond and Chuck for your input. It > helped clarify my thoughts a lot. > > Is anyone working on proper statx support for NFS, or is it a case of > "that shouldn't be hard and we should do that, but it isn't a high > priority for anyone" ?? How about something like the following? Cheers Trond 8<-------------------------------------------------------- From 755b6771deb8d793c90f56fddf7070d7c2ea87b5 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> Date: Thu, 4 Jan 2018 17:46:09 -0500 Subject: [PATCH] Support statx() mask and query flags parameters Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute revalidation, and AT_STATX_DONT_SYNC by returning cached attributes only. Use the mask to optimise away server revalidation for attributes that are not being requested by the user. Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> --- fs/nfs/inode.c | 40 ++++++++++++++++++++++++++++++++++------ 1 file changed, 34 insertions(+), 6 deletions(-) diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index b112002dbdb2..a703b1d1500d 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -735,12 +735,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); - int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME; + unsigned long cache_validity; + bool force_sync = query_flags & AT_STATX_FORCE_SYNC; + bool dont_sync = !force_sync && query_flags & AT_STATX_DONT_SYNC; + bool need_atime = !dont_sync; + bool need_cmtime = !dont_sync; + bool reval = force_sync; int err = 0; + if (!(request_mask & STATX_ATIME)) + need_atime = false; + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) + need_cmtime = false; + trace_nfs_getattr_enter(inode); /* Flush out writes to the server in order to update c/mtime. */ - if (S_ISREG(inode->i_mode)) { + if (S_ISREG(inode->i_mode) && need_cmtime) { err = filemap_write_and_wait(inode->i_mapping); if (err) goto out; @@ -757,9 +767,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat, */ if ((path->mnt->mnt_flags & MNT_NOATIME) || ((path->mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))) - need_atime = 0; - - if (need_atime || nfs_need_revalidate_inode(inode)) { + need_atime = false; + + /* Check for whether the cached attributes are invalid */ + cache_validity = READ_ONCE(NFS_I(inode)->cache_validity); + if (need_cmtime) + reval |= cache_validity & NFS_INO_REVAL_PAGECACHE; + if (need_atime) + reval |= cache_validity & NFS_INO_INVALID_ATIME; + if (request_mask & (STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID| + STATX_ATIME|STATX_MTIME|STATX_CTIME| + STATX_SIZE|STATX_BLOCKS)) + reval |= cache_validity & NFS_INO_INVALID_ATTR; + if (dont_sync) + reval = false; + + if (reval) { struct nfs_server *server = NFS_SERVER(inode); if (!(server->flags & NFS_MOUNT_NOAC)) @@ -767,13 +790,18 @@ int nfs_getattr(const struct path *path, struct kstat *stat, else nfs_readdirplus_parent_cache_hit(path->dentry); err = __nfs_revalidate_inode(server, inode); - } else + } else if (!dont_sync) nfs_readdirplus_parent_cache_hit(path->dentry); if (!err) { generic_fillattr(inode, stat); stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode)); if (S_ISDIR(inode->i_mode)) stat->blksize = NFS_SERVER(inode)->dtsize; + /* Return only the requested attrs if others may be stale */ + if (!reval && cache_validity & (NFS_INO_REVAL_PAGECACHE| + NFS_INO_INVALID_ATIME| + NFS_INO_INVALID_ATTR)) + stat->result_mask &= request_mask; } out: trace_nfs_getattr_exit(inode, err); -- 2.14.3 -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx
Attachment:
signature.asc
Description: This is a digitally signed message part