On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote: > Hi, > > I just thought I'd flesh out the other two issues I have found with > re-exporting that are ultimately responsible for the biggest > performance bottlenecks. And both of them revolve around the caching > of metadata file lookups in the NFS client. > > Especially for the case where we are re-exporting a server many > milliseconds away (i.e. on-premise -> cloud), we want to be able to > control how much the client caches metadata and file data so that > it's many LAN clients all benefit from the re-export server only > having to do the WAN lookups once (within a specified coherency > time). > > Keeping the file data in the vfs page cache or on disk using > fscache/cachefiles is fairly straightforward, but keeping the > metadata cached is particularly difficult. And without the cached > metadata we introduce long delays before we can serve the already > present and locally cached file data to many waiting clients. > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@xxxxxxxx wrote: > > 2) If we cache metadata on the re-export server using > > actimeo=3600,nocto we can > > cut the network packets back to the origin server to zero for > > repeated lookups. > > However, if a client of the re-export server walks paths and memory > > maps those > > files (i.e. loading an application), the re-export server starts > > issuing > > unexpected calls back to the origin server again, > > ignoring/invalidating the > > re-export server's NFS client cache. We worked around this this by > > patching an > > inode/iversion validity check in inode.c so that the NFS client > > cache on the > > re-export server is used. I'm not sure about the correctness of > > this patch but > > it works for our corner case. > > If we use actimeo=3600,nocto (say) to mount a remote software volume > on the re-export server, we can successfully cache the loading of > applications and walking of paths directly on the re-export server > such that after a couple of runs, there are practically zero packets > back to the originating NFS server (great!). But, if we then do the > same thing on a client which is mounting that re-export server, the > re-export server now starts issuing lots of calls back to the > originating server and invalidating it's client cache (bad!). > > I'm not exactly sure why, but the iversion of the inode gets changed > locally (due to atime modification?) most likely via invocation of > method inode_inc_iversion_raw. Each time it gets incremented the > following call to validate attributes detects changes causing it to > be reloaded from the originating server. > > This patch helps to avoid this when applied to the re-export server > but there may be other places where this happens too. I accept that > this patch is probably not the right/general way to do this, but it > helps to highlight the issue when re-exporting and it works well for > our use case: > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 > 00:23:03.000000000 +0000 > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000 > @@ -1869,7 +1869,7 @@ > > /* More cache consistency checks */ > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { > - if (!inode_eq_iversion_raw(inode, fattr- > >change_attr)) { > + if (inode_peek_iversion_raw(inode) < fattr- > >change_attr) { > /* Could it be a race with writeback? */ > if (!(have_writers || have_delegation)) { > invalid |= NFS_INO_INVALID_DATA There is nothing in the base NFSv4, and NFSv4.1 specs that allow you to make assumptions about how the change attribute behaves over time. The only safe way to do something like the above is if the server supports NFSv4.2 and also advertises support for the 'change_attr_type' attribute. In that case, you can check at mount time for whether or not the change attribute on this filesystem is one of the monotonic types which would allow the above optimisation. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx