Re: NFS server (round-robin IP) times out: How does autofs behave? How can we fix that on the client side?

Frank Thommen <list.autofs@xxxxxxxxxx> · Mon, 24 Dec 2018 23:33:36 +0100

On 23/12/18 00:49, Ian Kent wrote:
On Fri, 2018-12-21 at 16:15 +0100, Frank Thommen wrote:
On 12/21/18 11:02 AM, Frank Thommen wrote:
Dear all.

@work we are struggling with NFS server timeouts and subsequentially
missing mounts on the clients:

[...]
Dec 21 10:12:20 XXX kernel: nfs: server SRV not responding, timed out
Dec 21 10:12:20 XXX automount[41879]: mount(nfs): nfs: mount failure
SRV:/a/b/c on /d/e/f
[...]

The server timing out is a storage cluster with multiple IPs, served in
round-robin mode.  Does autofs in cases of connectivity problems try to
resolve the server name multiple times - and then maybe get a "good" IP
- or is it "stuck" on the IP it get's when the initial mount request is
made?

If autofs does not re-resolve server names: Is there a way to provide
autofs with multiple names/ips which autofs tries all to find a possibly
working head node?  How would this have to be configured?

I found the "Replicated Server" feature. How does autofs use the
different entries? Does it make a "round-robin" on it's own?  And how
does autofs behave, if one of the multiple entries is not reachable or
the NFS server times out?

Because autofs has no control over what happens to an NFS mount once
it is mounted it can't do any "fail-over" of active NFS mounts.

This feature would need to be implemented in NFS itself not autofs.

In fact our problem is not, that active mounts time out but that the 
mount request itself times out.  The result is then no active mount at all.

All autofs can do is, when given a list of replicated servers upon
which it can find the same file system, is to try each of them at
initial mount time until it gets one that works.

That's exactly the feature we are looking for.

I would have to look at the code but I think I do the same thing
when an NFS server name resolves to multiple addresses via DNS.

If that is the case, then we most likely have an other problem with our 
storage.  I'd be really interested to know if autofs does this anyway.

Note that even if the kernel NFS folks implemented fail-over it
would likely be for "read-only" replicated file systems only due
to the problems of cache-coherency between servers for the
writeable case (read as file system corruption risk).

I understand this, but in a round-robin scenario all IPs point to the 
same physical server anyway.  If autofs doesn't try all round-robin IPs 
in turn, then we could/would mimic this behaviour with the replicated 
server feature.

autofs does this because it isn't implemented in the NFS client
and it doesn't check and enforce the "read-only" requirement
as it can get away with that because it does it only at mount
time.

And that is good so :-)

Thanks a lot
frank

Ian