Re: nfs_file_flush() question

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 18 Aug 2008 12:04:01 -0400

On Aug 17, 2008, at 1:04 PM, Trond Myklebust wrote:
On Sat, 2008-08-16 at 19:23 -0500, Quentin Barnes wrote:
I've been coming up to speed on the NFS protocol and its NFS client
support in Linux.  I've been comparing performance of NFS on RHEL4
and RHEL5 vs. FreeBSD 6.2.  (Okay, we're on an old base, but I don't
think it matters here for this question.)

In watching the NFS protocols fly back and forth between BSD
and Linux clients to an NFS server, I noticed that Linux is
doing an extra GETATTR over FreeBSD when closing a read-write
file.  I tracked this back to nfs_file_flush() which is
doing a __nfs_revalidate_inode() (or in current kernels
nfs_revalidate_inode()).  Why do we want nfs_file_flush() to force
a revalidate of an inode we're closing?  Why not instead just
invalidate the inode's attribute?

I looked at the FreeBSD 6.2 code.  In its nfs_close(), it does an
"np->n_attrstamp = 0;" to invalidate the inode's attribute cache.

The current Linux kernel code in question in nfs_file_flush() is:
==========
  /* Ensure that data+attribute caches are up to date after close()  
*/
  status = nfs_do_fsync(ctx, inode);
  if (!status)
          nfs_revalidate_inode(NFS_SERVER(inode), inode);
==========

I would imagine this better as:
==========
  /* Ensure that data+attribute caches are up to date after close()  
*/
  status = nfs_do_fsync(ctx, inode);
  if (!status && !(NFS_SERVER(inode)->flags & NFS_MOUNT_NOCTO))
          NFS_I(inode)->cache_validity |= NFS_INO_INVALID_ATTR;
==========

Is there a reason I'm missing that the revalidate and GETATTR are
required?

Yes: It is required for correct close-to-open cache consistency
semantics.

If I don't know the correct mtime attribute of the file when I close  
it,
then I can't compare it with the mtime of the file when I open it  
again.
If so, close-to-open semantics forbid me from assuming that my cached
data is still valid, and so I have to throw out the entire page cache
contents for that file.

For NFSv3, a WRITE operation can return post-op attributes which  
reflect the updated mtime on the server.  In that case, we get the  
updated mtime for free, and don't need the additional GETATTR.

However, there are corner cases:

1.  The client had to send multiple WRITEs to finish flushing the  
file's dirty data, and the replies returned out of order.  In this  
case, the client can't know which mtime is the final value.

2.  The server chose not to return post-op attributes.

Does the Linux NFS client optimize away the GETATTR when it has sent  
only a single WRITE and the server has returned post-op attributes?   
Using a large wsize with a modern server implementation might make  
this a fairly common scenario.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html