Re: NFS server (round-robin IP) times out: How does autofs behave? How can we fix that on the client side?

Ian Kent <raven@xxxxxxxxxx> · Sun, 23 Dec 2018 07:49:09 +0800

On Fri, 2018-12-21 at 16:15 +0100, Frank Thommen wrote:
> On 12/21/18 11:02 AM, Frank Thommen wrote:
> > Dear all.
> > 
> > @work we are struggling with NFS server timeouts and subsequentially 
> > missing mounts on the clients:
> > 
> > [...]
> > Dec 21 10:12:20 XXX kernel: nfs: server SRV not responding, timed out
> > Dec 21 10:12:20 XXX automount[41879]: mount(nfs): nfs: mount failure 
> > SRV:/a/b/c on /d/e/f
> > [...]
> > 
> > The server timing out is a storage cluster with multiple IPs, served in 
> > round-robin mode.  Does autofs in cases of connectivity problems try to 
> > resolve the server name multiple times - and then maybe get a "good" IP 
> > - or is it "stuck" on the IP it get's when the initial mount request is 
> > made?
> > 
> > If autofs does not re-resolve server names: Is there a way to provide 
> > autofs with multiple names/ips which autofs tries all to find a possibly 
> > working head node?  How would this have to be configured?
> 
> I found the "Replicated Server" feature. How does autofs use the 
> different entries? Does it make a "round-robin" on it's own?  And how 
> does autofs behave, if one of the multiple entries is not reachable or 
> the NFS server times out?

Because autofs has no control over what happens to an NFS mount once
it is mounted it can't do any "fail-over" of active NFS mounts.

This feature would need to be implemented in NFS itself not autofs.

All autofs can do is, when given a list of replicated servers upon
which it can find the same file system, is to try each of them at
initial mount time until it gets one that works.

I would have to look at the code but I think I do the same thing
when an NFS server name resolves to multiple addresses via DNS.

Note that even if the kernel NFS folks implemented fail-over it
would likely be for "read-only" replicated file systems only due
to the problems of cache-coherency between servers for the
writeable case (read as file system corruption risk).

autofs does this because it isn't implemented in the NFS client
and it doesn't check and enforce the "read-only" requirement
as it can get away with that because it does it only at mount
time.

Ian