Re: help with understanding match_fsid() errors in nfs-utils

"NeilBrown" <neilb@xxxxxxx> · Wed, 11 Oct 2023 10:38:23 +1100

On Thu, 28 Sep 2023, Trond Myklebust wrote:
> On Wed, 2023-09-27 at 12:10 +1000, NeilBrown wrote:
> > 
> > hi Trond,
> >  I'm trying to understand
> > 
> >  Commit 76c21e3f70a8 ("mountd: Check the stat() return values in
> > match_fsid()")
> > 
> >  in nfs-utils.
> > 
> >  The effect of this patch is that if a 'stat' of any path in
> >  /etc/exports or any mountpoint below any path marked crossmnt fails
> >  with an error other than one of a small set, then the fsid to path
> >  lookup aborts without reporting anything to the kernel, so the
> > kernel
> >  doesn't reply to the client and the mount attempt blocks
> > indefinitely.
> > 
> >  I have seen this happen when "/" is exported crossmnt, and when a
> > stat
> >  of /run/user/1000/doc returns EACCES.  This is a "fuse" mount for
> > user
> >  1000, and presumably it rejects any access from any other user.
> > 
> >  Could you please help me understand what this patch was trying to
> >  achieve?  What sorts of errors were you expecting this to catch?
> >  Would it make sense to silently ignore the stat failure for paths
> > that
> >  were found when scanning the mount table, and only take the more
> >  drastic action for paths explicitly listed in /etc/exports ??
> > 
> > 
> 
> The point is that if the filesystem is re-exported NFS, and you mount
> as 'softerr' in order to avoid hanging your knfsd server on ephemeral
> errors, then you have to be more careful about which errors are fatal,
> and hence need to be propagated to the kernel export/path cache, and
> which ones are just temporary due to a network partition or server
> reboot.
> 
> Hence the creation of path_lookup_error() which differentiates between
> the two cases.

Thanks for the explanation - and sorry about the delay in following up -
leave and stuff....

If I understand the 'softerr' option correctly, the particular error
code you need to catch in this case is ETIMEDOUT.  That is the error the
mountd would get when stating an NFS mount that wasn't responding.
So it would be safe for me to add EACCES to the set of errors that
path_lookup_error() checks - just as long as I don't add ETIMEDOUT.
Do you agree?

But I think there is another issue here.  If the 'stat' of the
mountpoint returns ETIMEDOUT we want a transient failure so that the
kernel will ask again - presumably when the client asks again.
That doesn't happen with the current code.  When the kernel queues a
request to mountd, it assumes that the request WILL be answered.  It
never resends the same request.

So mountd must reply with something.  I think that returning a negative
result with an expiry that has already passed might have the desired
result, but we would need to test that.

Thanks,
NeilBrown