On Thu, Jul 08, 2010 at 11:43:46AM -0400, Chuck Lever wrote: > NFSv3 WCC data, or "weak cache consistency" data, is an attempt to > reduce the number of on-the-wire transactions needed by NFS clients > to keep their caches up to date. WCC data consists of two parts: > > o pre-op data, which is a subset of file metadata as it was before > a procedure starts, and > > o post-op data, which is a full set of NFSv3 file attribute data as > it is after a procedure finishes. > > For an NFSv3 procedure that returns wcc_data, an NFSv3 server is free > to return either, both, or neither of these. Define "full WCC data" > as a reply that contains both pre-op and post-op data. > > To make this data useful, a server must ensure that file metadata is > captured atomically around the requested operation. If the pre-op > data in a reply matches the file metadata that the client already has > cached, the client can assume that no other operation on that file > occurred while the server was fulfilling the current request, and that > therefore the post-op metadata is the latest version, and can be > cached. > > Conversely, NFSv3 clients invalidate their metadata caches when they > receive replies to metadata altering operations that do not contain > full WCC data. When a server presents a reply that does not have both > pre-op and post-op WCC data, clients must employ extra LOOKUP and > GETATTR requests to ensure their metadata caches are up to date, > causing performance to suffer needlessly. For example, untarring a > large tar file can take almost an order of magnitude longer in this > case, depending on the client implementation. > > In the Linux NFS server implementation, to ensure that WCC data > reflects only changes made during the current file system operation, > the file's inode mutex is held in order to serialize metadata altering > operations on that inode. Our server saves pre-op data for a file > handle just after the target inode's mutex is taken, and saves post-op > data just before the inode's mutex is dropped (see fh_lock() and > fh_unlock()). > > In order to return full WCC data to clients, our server must have both > the saved pre-op and the saved post-op attribute data for a file > handle filled in before it starts XDR encoding the reply. > Unfortunately, for the NFSv3 MKDIR, MKNOD, REMOVE, RMDIR procedures, > our server does not unlock the parent directory inode until well after > the reply has been XDR encoded. > > In these cases, encode_wcc_data() does have saved pre-op WCC data > available, since the fh is locked, but does not have saved post-op WCC > data for the parent directory, since it hasn't yet been unlocked. In > this situation, encode_wcc_data() simply grabs the parent's current > metadata, uses that as the post-op WCC data, and returns no pre-op > WCC data to the client. > > By instead unlocking the parent directory file handle immediately > after the internal operations for each of these NFS procedures is > complete, saved post-op WCC data for the file handle is filled in > before XDR encoding starts, so full WCC data for that procedure can > be returned to clients. > > Note that the NFSv4 CREATE and REMOVE procedures already invoke > fh_unlock() explicitly on the parent directory in order to fill in the > NFSv4 post change attribute. > > Note also that NFSv3 CREATE, RENAME, SETATTR, and SYMLINK already > perform explicit file handle unlocking. > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > --- > > Bruce- > > This patch is mechanically the same as the previous one, but the patch > description has a more accurate and clearly stated rationale for the > change. > > Please use this one instead of the previous one. Thanks. The first is already committed, though. I'm not sure if there's a good place for the longer explanation in the docs, so it may just have to live on in the list archives. (It's the testcase (untarring a linux kernel from an OS X client) that I was mainly curious about.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html