Re: Adventures in NFS re-exporting

"bfields@xxxxxxxxxxxx" <bfields@xxxxxxxxxxxx> · Wed, 23 Sep 2020 13:07:35 -0400

On Wed, Sep 23, 2020 at 01:09:01PM +0000, Trond Myklebust wrote:
> On Wed, 2020-09-23 at 08:40 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > Hi, 
> > > > 
> > > > I just thought I'd flesh out the other two issues I have found
> > > > with
> > > > re-exporting that are ultimately responsible for the biggest
> > > > performance bottlenecks. And both of them revolve around the
> > > > caching
> > > > of metadata file lookups in the NFS client.
> > > > 
> > > > Especially for the case where we are re-exporting a server many
> > > > milliseconds away (i.e. on-premise -> cloud), we want to be able
> > > > to
> > > > control how much the client caches metadata and file data so that
> > > > it's many LAN clients all benefit from the re-export server only
> > > > having to do the WAN lookups once (within a specified coherency
> > > > time).
> > > > 
> > > > Keeping the file data in the vfs page cache or on disk using
> > > > fscache/cachefiles is fairly straightforward, but keeping the
> > > > metadata cached is particularly difficult. And without the cached
> > > > metadata we introduce long delays before we can serve the already
> > > > present and locally cached file data to many waiting clients.
> > > > 
> > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@xxxxxxxx wrote:
> > > > > 2) If we cache metadata on the re-export server using
> > > > > actimeo=3600,nocto we can
> > > > > cut the network packets back to the origin server to zero for
> > > > > repeated lookups.
> > > > > However, if a client of the re-export server walks paths and
> > > > > memory
> > > > > maps those
> > > > > files (i.e. loading an application), the re-export server
> > > > > starts
> > > > > issuing
> > > > > unexpected calls back to the origin server again,
> > > > > ignoring/invalidating the
> > > > > re-export server's NFS client cache. We worked around this this
> > > > > by
> > > > > patching an
> > > > > inode/iversion validity check in inode.c so that the NFS client
> > > > > cache on the
> > > > > re-export server is used. I'm not sure about the correctness of
> > > > > this patch but
> > > > > it works for our corner case.
> > > > 
> > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > volume
> > > > on the re-export server, we can successfully cache the loading of
> > > > applications and walking of paths directly on the re-export
> > > > server
> > > > such that after a couple of runs, there are practically zero
> > > > packets
> > > > back to the originating NFS server (great!). But, if we then do
> > > > the
> > > > same thing on a client which is mounting that re-export server,
> > > > the
> > > > re-export server now starts issuing lots of calls back to the
> > > > originating server and invalidating it's client cache (bad!).
> > > > 
> > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > changed
> > > > locally (due to atime modification?) most likely via invocation
> > > > of
> > > > method inode_inc_iversion_raw. Each time it gets incremented the
> > > > following call to validate attributes detects changes causing it
> > > > to
> > > > be reloaded from the originating server.
> > > > 
> > > > This patch helps to avoid this when applied to the re-export
> > > > server
> > > > but there may be other places where this happens too. I accept
> > > > that
> > > > this patch is probably not the right/general way to do this, but
> > > > it
> > > > helps to highlight the issue when re-exporting and it works well
> > > > for
> > > > our use case:
> > > > 
> > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > > > 00:23:03.000000000 +0000
> > > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > > @@ -1869,7 +1869,7 @@
> > > >  
> > > >         /* More cache consistency checks */
> > > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > -               if (!inode_eq_iversion_raw(inode, fattr-
> > > > > change_attr)) {
> > > > +               if (inode_peek_iversion_raw(inode) < fattr-
> > > > > change_attr) {
> > > >                         /* Could it be a race with writeback? */
> > > >                         if (!(have_writers || have_delegation)) {
> > > >                                 invalid |= NFS_INO_INVALID_DATA
> > > 
> > > There is nothing in the base NFSv4, and NFSv4.1 specs that allow
> > > you to
> > > make assumptions about how the change attribute behaves over time.
> > > 
> > > The only safe way to do something like the above is if the server
> > > supports NFSv4.2 and also advertises support for the
> > > 'change_attr_type'
> > > attribute. In that case, you can check at mount time for whether or
> > > not
> > > the change attribute on this filesystem is one of the monotonic
> > > types
> > > which would allow the above optimisation.
> > 
> > Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
> > think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?
> > 
> > The Linux server's ctime is monotonic and will advertise that with
> > change_attr_type since 4.19.
> > 
> > So I think it would be easy to patch the client to check
> > change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
> > server->caps, the hard part would be figuring out which optimisations
> > are OK.
> > 
> 
> The ctime is *not* monotonic. It can regress under server reboots and
> it can regress if someone deliberately changes the time.

So, anything other than IS_UNDEFINED or IS_TIME_METADATA?

Though the linux server is susceptible to some of that even when it
returns MONTONIC_INCR.  If the admin replaces the filesystem by an older
snapshot, there's not much we can do.  I'm not sure what degree of
gaurantee we need.

--b.

> We have code
> that tries to handle all these issues (see fattr->gencount and nfsi-
> >attr_gencount) because we've hit those issues before...