Re: NFS server (round-robin IP) times out: How does autofs behave? How can we fix that on the client side?

Ian Kent <raven@xxxxxxxxxx> · Wed, 30 Jan 2019 08:52:58 +0800

On Tue, 2019-01-29 at 16:32 +0100, Frank Thommen wrote:
> Top-posted, scrolling-avoiding summary: Thanks for your help and hints. 
> I will keep them in mind in case the issue hits us again.  However 
> central IT didn't like the idea to add "autofs host selection logic" 
> over their own DNS round robin.  In the meantime the server-side issue 
> has been solved and autofs mounts happily ever after.

Thanks for letting me know the result of your investigation.

Of course if your distribution updates its autofs you would
get this behaviour whether you want it or not.

If that happens ensure that your autofs also has the
configuration option "use_hostname_for_mounts" which should,
more or less, disable the availability probe.

If you find yourself in this situation let me know and I'll
verify the behaviour of this configuration option.

Ian

> 
> Cheers
> frank
> 
> 
> On 12/25/18 1:17 AM, Ian Kent wrote:
> > On Mon, 2018-12-24 at 23:55 +0100, Frank Thommen wrote:
> > > On 23/12/18 04:30, Ian Kent wrote:
> > > > On Fri, 2018-12-21 at 11:02 +0100, Frank Thommen wrote:
> > > > > Dear all.
> > > > > 
> > > > > @work we are struggling with NFS server timeouts and subsequentially
> > > > > missing mounts on the clients:
> > > > 
> > > > Sorry for the multiple posts on this but things often occur to me
> > > > as I think about what's been written upon re-reading questions.
> > > > 
> > > > > 
> > > > > [...]
> > > > > Dec 21 10:12:20 XXX kernel: nfs: server SRV not responding, timed out
> > > > > Dec 21 10:12:20 XXX automount[41879]: mount(nfs): nfs: mount failure
> > > > > SRV:/a/b/c on /d/e/f
> > > > > [...]
> > > > > 
> > > > > The server timing out is a storage cluster with multiple IPs, served
> > > > > in
> > > > > round-robin mode.  Does autofs in cases of connectivity problems try
> > > > > to
> > > > > resolve the server name multiple times - and then maybe get a "good"
> > > > > IP
> > > > > - or is it "stuck" on the IP it get's when the initial mount request
> > > > > is
> > > > > made?
> > > > 
> > > > Another possibility comes to mind.
> > > > 
> > > > If the problem is related purely to server selection for mount there
> > > > was a problem with that in the past.
> > > > 
> > > > It occurred specifically when the server name resolved to multiple
> > > > addresses.
> > > > 
> > > > The availability probe would be done to select a host for mounting but
> > > > because there was a round-robin DNS in place the subsequent mount would
> > > > end up using a different address, possibly of a host that was no longer
> > > > responding.
> > > > 
> > > > That problem was resolved by using IP address instead of host name for
> > > > this case. Some people didn't much like that because the use of IP
> > > > address made it more difficult to work out what was going on when
> > > > looking at logs.
> > > 
> > > I normally don't like IP addresses in any configuration for various
> > > reasons, but in the current case they could effectively help us, as the
> > > `mount` timeout message would report the actual IP of the used head node
> > > and not the hostname of the storage cluster.  So instead of
> > > 
> > >     mymount  our.storage.server:/export/share
> > > 
> > > we would have
> > > 
> > >     mymount  1.2.3.1,1.2.3.2,1.2.3.3:/export/share
> > > 
> > > so that `mount` would target individual IP numbers instead of global
> > > storage cluster names.
> > 
> > Indeed, I didn't like having to use IP addresses either but it's
> > the only way to ensure a mount is performed to specific host when
> > the name resolves to multiple addresses, ;)
> > 
> > You shouldn't need to change your maps to make this work but that
> > depends on what your autofs is doing .... see below ...
> > 
> > > 
> > > 
> > > > The trick here is first checking that autofs is doing the availability
> > > > probe for the map entry you're using (which it might not be) and then
> > > > checking mount attempts are using IP address at mount time, not host
> > > > name.
> > > 
> > > I'm not sure I understand this statement.
> > 
> > Paraphrasing what I said "we need to check the (full debug) log to
> > see what your version of autofs is doing".
> > 
> > > 
> > > 
> > > > So we would need to check the functionality of the autofs you are using
> > > > if you think it's worth going further with this.
> > > 
> > > If you think that the replicated server setup should work, then we will
> > > try it.  However due to bank holidays & co. we will not be able to
> > > implement this in the next two weeks (and hence I will not be able to
> > > report sucess or failure very soon).
> > 
> > Changing your maps to use IP addresses should work even if your
> > version of autofs doesn't do the right thing (and it sounds like
> > it might not) but there are other things that can go wrong too
> > so we need to check what autofs is doing in your case.
> > 
> > Enabling debug logging can be a bit of a pain depending on whether
> > your using systemd or not.
> > 
> > First you need to check where your autofs configuration is located.
> > 
> > The autofs configuration has been located at /etc/autofs.conf for
> > quite a while now but it can be at other locations.
> > 
> > In Fedora and RHEL it was previously at /etc/sysconfig/autofs. I'm
> > not sure about Debian based distributions, it might have been in
> > /etc/default/autofs.
> > 
> > Once you locate it set "logging = debug" (or "LOGGING=debug" in older
> > configurations or DEFAULT_LOGGING="debug" in very old configs).
> > 
> > If you are using a systemd based distribution you should only need
> > to use journalctl to capture the log output, eg.
> > 
> > "journalctl -f | tee autofs.log"
> > 
> > should do the trick.
> > 
> > If your using a syslog based system then you need to ensure facility
> > daemon is recording log level debug and above, that can be done by
> > adding something like:
> > 
> > daemon.*                   /var/log/daemon.debug
> > 
> > to the syslog configuration.
> > 
> > Then try a mount and save the log.
> > 
> > Generally it's best to look at a log that has everything from
> > startup until after the problem, the mount in this case, so
> > all the information that might be useful is present.
> > 
> > Ian
> > 
> 
>