Re: Issue with automounter file breaks cluster

Ian Kent <raven@xxxxxxxxxx> · Thu, 16 May 2019 08:31:06 +0800

On Wed, 2019-05-15 at 17:01 +0200, Frank Thommen wrote:
> Hi Ian, list,
> 
> On 2/2/19 2:16 AM, Ian Kent wrote:
> > On Thu, 2019-01-31 at 22:26 +0100, Frank Thommen wrote:
> > > 
> > > We are running autofs 5.0.7, release 70.el7_4.1 on CentOS 7.4.1708.
> > > Updating the CentOS release ist not possible due to hardware and
> > > software constraints.
> > > 
> > 
> > Before I go burning lots of time on trying to reproduce this
> > you should check if this happens with the latest CentOS autofs
> > package, revision 90 (the CentOS repo doesn't look like it
> > retains older revisions).
> > 
> > There were a couple of regressions fixed in 7.5 amount other
> > things.
> > 
> > I don't think there were updates to dependent packages that
> > would cause problems in the subsequent RHEL releases (in fact
> > there shouldn't be).
> > 
> > Another question.
> > 
> > When you see the problem has occurred did you check that
> > automount is actually still running (IOW, did you check if
> > it had crashed).
> > 
> > Ian
> > 
> 
> Oops, already > three months and I haven't replied yet.  I'm very sorry 
> for that, because I really appreciate your reactiveness and helpfulness. 
>   Unfortunately other IT problems have outpaced this one.

Understood.

The difficulty here is that this sort of problem pops up occasionally
in (possibly) slightly different scenarios and usually defies debug
efforts.

I try as best I can and from time to time some unrelated bug gets
resolved and it occurs to me that it might have been something that
contributed to this.

But in reality whatever the problem (or problems) is it's really hard
to work out what causes it.

> 
> Summary: In the end we "flattened" the automounter file structure, so 
> that instead of using
> 
>    auto.master: /base /etc/auto.base browse
>    auto.base:   sub1 /sub11 -fstype=autofs,vers=3 file:/etc/auto.sub11
>    auto.sub11:  sub11-1 server:/export1
>                 sub11-2 server:/export2
> 
> we now use
> 
>    auto.master: /base/sub1 /etc/auto.sub1 browse
>    auto.sub1:   sub11 -fstype=nfs,vers=3 \
>                    sub11-1 server:/export1 \
>                    sub11-2 server:/export2
> 
> this solution is as manageable as the first one and the problems 
> described in my original post have gone since then. It "works for us", 
> even though we don't understand what didn't work.  Since the issue - as 
> we have learned in the meantime - overlapped with networking problems of 
> the central storage, the problem /could/ have been an unfortunate 
> concidence, triggering the described problem.

Right, at least you have a workable solution.

> 
> For the sake of completeness and documentation I'll answer your last and 
> still unanswered questions:
> 
> > > > Is it always the same directory that becomes unresponsive?
> > > 
> > > It's all the directories managed by this table.
> > 
> > My original reading of the problem description made me think
> > that only certain automount points became unresponsive.
> > 
> > If "all" the automounts become unresponsive that's a very different
> > problem.
> > 
> 
> only /certain/ directories got lost, but not always the same ones and 
> not on all hosts the same ones.

This is an example of the difficulty, that behaviour isn't consistent
with what I think the problem is so I'm immediately stuck wondering
what could be going on.

> 
> 
> > > > Does the problem also occur if you use a HUP signal to re-read the
> > > > maps?
> > > 
> > > Haven't tried this yet. We usually just restart autofs.
> > 
> > I think this is another misunderstanding of the problem I have.
> > 
> > The description sounded like it was the restart with a modified
> > map that resulted in the problem but based on this and your later
> > reply it sounds like the restart fixes the problem.
> > 
> > That implies that modifying the map results in this automount
> > becoming unresponsive at some later time after the map change.
> > 
> > Have I got it right now?
> 
> Not quite :-)  An automounter restart with he /un/modified map always 
> solved the issue...for some time until some of the directories became 
> unavailable again...

Again not consistent with what I think could cause this.
It's hard to work out what's happening here.

> 
> 
> > Is there anything in the debug log about a map re-read (and
> > following log entries from that) between the time the map is
> > deployed and when the problem occurrs?
> > 
> > Are you sure you're getting all the debug logging?
> > If your assuming that setting "loggin = debug" in the autofs
> > configuration and using syslog with a default configuration
> > you might not be. How have setup to collect the debug log?
> 
> I haven't looked in the details of complete autofs debugging.  Basically 
> the daemon is running with "-d --foreground --dont-check-daemon" (set in 
> /etc/sysconfig/autofs as 'OPTIONS="-d"')

I only mentioned it because the log looked like it was missing
some entries. If your using some sort of syslog implementation
it's configuration can ignore certain log levels so log entries
are missing.

> 
> 
> Again thank you very much for your efforts

Ha, although I couldn't actually help!

Ian