On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote: > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote: > > Hi, > > > > I just thought I'd flesh out the other two issues I have found with > > re-exporting that are ultimately responsible for the biggest > > performance bottlenecks. And both of them revolve around the caching > > of metadata file lookups in the NFS client. > > > > Especially for the case where we are re-exporting a server many > > milliseconds away (i.e. on-premise -> cloud), we want to be able to > > control how much the client caches metadata and file data so that > > it's many LAN clients all benefit from the re-export server only > > having to do the WAN lookups once (within a specified coherency > > time). > > > > Keeping the file data in the vfs page cache or on disk using > > fscache/cachefiles is fairly straightforward, but keeping the > > metadata cached is particularly difficult. And without the cached > > metadata we introduce long delays before we can serve the already > > present and locally cached file data to many waiting clients. > > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@xxxxxxxx wrote: > > > 2) If we cache metadata on the re-export server using > > > actimeo=3600,nocto we can > > > cut the network packets back to the origin server to zero for > > > repeated lookups. > > > However, if a client of the re-export server walks paths and memory > > > maps those > > > files (i.e. loading an application), the re-export server starts > > > issuing > > > unexpected calls back to the origin server again, > > > ignoring/invalidating the > > > re-export server's NFS client cache. We worked around this this by > > > patching an > > > inode/iversion validity check in inode.c so that the NFS client > > > cache on the > > > re-export server is used. I'm not sure about the correctness of > > > this patch but > > > it works for our corner case. > > > > If we use actimeo=3600,nocto (say) to mount a remote software volume > > on the re-export server, we can successfully cache the loading of > > applications and walking of paths directly on the re-export server > > such that after a couple of runs, there are practically zero packets > > back to the originating NFS server (great!). But, if we then do the > > same thing on a client which is mounting that re-export server, the > > re-export server now starts issuing lots of calls back to the > > originating server and invalidating it's client cache (bad!). > > > > I'm not exactly sure why, but the iversion of the inode gets changed > > locally (due to atime modification?) most likely via invocation of > > method inode_inc_iversion_raw. Each time it gets incremented the > > following call to validate attributes detects changes causing it to > > be reloaded from the originating server. > > > > This patch helps to avoid this when applied to the re-export server > > but there may be other places where this happens too. I accept that > > this patch is probably not the right/general way to do this, but it > > helps to highlight the issue when re-exporting and it works well for > > our use case: > > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 > > 00:23:03.000000000 +0000 > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000 > > @@ -1869,7 +1869,7 @@ > > > > /* More cache consistency checks */ > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { > > - if (!inode_eq_iversion_raw(inode, fattr- > > >change_attr)) { > > + if (inode_peek_iversion_raw(inode) < fattr- > > >change_attr) { > > /* Could it be a race with writeback? */ > > if (!(have_writers || have_delegation)) { > > invalid |= NFS_INO_INVALID_DATA > > > There is nothing in the base NFSv4, and NFSv4.1 specs that allow you to > make assumptions about how the change attribute behaves over time. > > The only safe way to do something like the above is if the server > supports NFSv4.2 and also advertises support for the 'change_attr_type' > attribute. In that case, you can check at mount time for whether or not > the change attribute on this filesystem is one of the monotonic types > which would allow the above optimisation. Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ? The Linux server's ctime is monotonic and will advertise that with change_attr_type since 4.19. So I think it would be easy to patch the client to check change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in server->caps, the hard part would be figuring out which optimisations are OK. --b.