Re: help with understanding match_fsid() errors in nfs-utils

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 11 Oct 2023 14:33:41 +0000

On Wed, 2023-10-11 at 10:38 +1100, NeilBrown wrote:
> On Thu, 28 Sep 2023, Trond Myklebust wrote:
> > On Wed, 2023-09-27 at 12:10 +1000, NeilBrown wrote:
> > > 
> > > hi Trond,
> > >  I'm trying to understand
> > > 
> > >  Commit 76c21e3f70a8 ("mountd: Check the stat() return values in
> > > match_fsid()")
> > > 
> > >  in nfs-utils.
> > > 
> > >  The effect of this patch is that if a 'stat' of any path in
> > >  /etc/exports or any mountpoint below any path marked crossmnt
> > > fails
> > >  with an error other than one of a small set, then the fsid to
> > > path
> > >  lookup aborts without reporting anything to the kernel, so the
> > > kernel
> > >  doesn't reply to the client and the mount attempt blocks
> > > indefinitely.
> > > 
> > >  I have seen this happen when "/" is exported crossmnt, and when
> > > a
> > > stat
> > >  of /run/user/1000/doc returns EACCES.  This is a "fuse" mount
> > > for
> > > user
> > >  1000, and presumably it rejects any access from any other user.
> > > 
> > >  Could you please help me understand what this patch was trying
> > > to
> > >  achieve?  What sorts of errors were you expecting this to catch?
> > >  Would it make sense to silently ignore the stat failure for
> > > paths
> > > that
> > >  were found when scanning the mount table, and only take the more
> > >  drastic action for paths explicitly listed in /etc/exports ??
> > > 
> > > 
> > 
> > The point is that if the filesystem is re-exported NFS, and you
> > mount
> > as 'softerr' in order to avoid hanging your knfsd server on
> > ephemeral
> > errors, then you have to be more careful about which errors are
> > fatal,
> > and hence need to be propagated to the kernel export/path cache,
> > and
> > which ones are just temporary due to a network partition or server
> > reboot.
> > 
> > Hence the creation of path_lookup_error() which differentiates
> > between
> > the two cases.
> 
> Thanks for the explanation - and sorry about the delay in following
> up -
> leave and stuff....
> 
> If I understand the 'softerr' option correctly, the particular error
> code you need to catch in this case is ETIMEDOUT.  That is the error
> the
> mountd would get when stating an NFS mount that wasn't responding.
> So it would be safe for me to add EACCES to the set of errors that
> path_lookup_error() checks - just as long as I don't add ETIMEDOUT.
> Do you agree?
> 

I think that should be safe, but it is worth testing. I don't remember
exactly why I excluded EACCES from the list.

> But I think there is another issue here.  If the 'stat' of the
> mountpoint returns ETIMEDOUT we want a transient failure so that the
> kernel will ask again - presumably when the client asks again.
> That doesn't happen with the current code.  When the kernel queues a
> request to mountd, it assumes that the request WILL be answered.  It
> never resends the same request.
> 
> So mountd must reply with something.  I think that returning a
> negative
> result with an expiry that has already passed might have the desired
> result, but we would need to test that.
> 

Ideally, we probably normally want to pass the filesystem error down
into knfsd so that it can do the right thing. For ETIMEDOUT, that
should result in an NFS4ERR_DELAY/NFS3ERR_JUKEBOX being sent to the
client, which is what we want here.

I'm not sure about caching these errors. It seems to me that while
ETIMEDOUT might allow a time-bounded cache, the EACCES is a per-user
error, so can't really be cached unless you are able to cache it as a
per-user thing in knfsd.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx