Re: Issue with automounter file breaks cluster

Frank Thommen <list.autofs@xxxxxxxxxx> · Wed, 15 May 2019 17:01:01 +0200

Hi Ian, list,

On 2/2/19 2:16 AM, Ian Kent wrote:
On Thu, 2019-01-31 at 22:26 +0100, Frank Thommen wrote:

We are running autofs 5.0.7, release 70.el7_4.1 on CentOS 7.4.1708.
Updating the CentOS release ist not possible due to hardware and
software constraints.

Before I go burning lots of time on trying to reproduce this
you should check if this happens with the latest CentOS autofs
package, revision 90 (the CentOS repo doesn't look like it
retains older revisions).

There were a couple of regressions fixed in 7.5 amount other
things.

I don't think there were updates to dependent packages that
would cause problems in the subsequent RHEL releases (in fact
there shouldn't be).

Another question.

When you see the problem has occurred did you check that
automount is actually still running (IOW, did you check if
it had crashed).

Ian

Oops, already > three months and I haven't replied yet.  I'm very sorry 
for that, because I really appreciate your reactiveness and helpfulness. 
 Unfortunately other IT problems have outpaced this one.

Summary: In the end we "flattened" the automounter file structure, so 
that instead of using

  auto.master: /base /etc/auto.base browse
  auto.base:   sub1 /sub11 -fstype=autofs,vers=3 file:/etc/auto.sub11
  auto.sub11:  sub11-1 server:/export1
               sub11-2 server:/export2

we now use

  auto.master: /base/sub1 /etc/auto.sub1 browse
  auto.sub1:   sub11 -fstype=nfs,vers=3 \
                  sub11-1 server:/export1 \
                  sub11-2 server:/export2

this solution is as manageable as the first one and the problems 
described in my original post have gone since then. It "works for us", 
even though we don't understand what didn't work.  Since the issue - as 
we have learned in the meantime - overlapped with networking problems of 
the central storage, the problem /could/ have been an unfortunate 
concidence, triggering the described problem.

For the sake of completeness and documentation I'll answer your last and 
still unanswered questions:

Is it always the same directory that becomes unresponsive?

It's all the directories managed by this table.

My original reading of the problem description made me think
that only certain automount points became unresponsive.

If "all" the automounts become unresponsive that's a very different
problem.

only /certain/ directories got lost, but not always the same ones and 
not on all hosts the same ones.

Does the problem also occur if you use a HUP signal to re-read the
maps?

Haven't tried this yet. We usually just restart autofs.

I think this is another misunderstanding of the problem I have.

The description sounded like it was the restart with a modified
map that resulted in the problem but based on this and your later
reply it sounds like the restart fixes the problem.

That implies that modifying the map results in this automount
becoming unresponsive at some later time after the map change.

Have I got it right now?

Not quite :-)  An automounter restart with he /un/modified map always 
solved the issue...for some time until some of the directories became 
unavailable again...

Is there anything in the debug log about a map re-read (and
following log entries from that) between the time the map is
deployed and when the problem occurrs?

Are you sure you're getting all the debug logging?
If your assuming that setting "loggin = debug" in the autofs
configuration and using syslog with a default configuration
you might not be. How have setup to collect the debug log?

I haven't looked in the details of complete autofs debugging.  Basically 
the daemon is running with "-d --foreground --dont-check-daemon" (set in 
/etc/sysconfig/autofs as 'OPTIONS="-d"')

Again thank you very much for your efforts
frank