Re: Issue with automounter file breaks cluster

Ian Kent <raven@xxxxxxxxxx> · Sat, 02 Feb 2019 08:42:34 +0800

On Fri, 2019-02-01 at 09:05 +0100, Frank Thommen wrote:
> On 01/02/19 00:54, Ian Kent wrote:
> > On Thu, 2019-01-31 at 22:26 +0100, Frank Thommen wrote:
> > > Dear all,
> > > 
> > > We have a weird issue with an automounter file, where a one-line change
> > > leads to the situation, that after a while (10-30 minutes?, maybe less),
> > > some of the mounts stop working on some(!) nodes (not always the same)
> > > of our HPC cluster.  This issue currently breaks our complete HPC
> > > cluster.  To make it worse, it is not really reproducible. However after
> > > three days of checking back and forth we are desperate and hope to find
> > > some helpful hint through the this maillist.
> > > 
> > > The symptom is, that trying to access one of the "virtual" automounted
> > > directories is answered with "No such file or directory" (see below).
> > > autofs doesn't seem to even try to mount the respective filesystem and I
> > > cannot recognize any previous problem in the log (debugging mode).
> > 
> > Is it always the same directory that becomes unresponsive?
> 
> It's all the directories managed by this table.

My original reading of the problem description made me think
that only certain automount points became unresponsive.

If "all" the automounts become unresponsive that's a very different
problem.

> 
> 
> > Do you see anything at all in the log when accessing the directory
> > and getting the ENOENT?
> 
> nope.  Absolutely nothing.

for any automount under this tree (just to be absolutely sure)?

> 
> 
> > Does the problem also occur if you use a HUP signal to re-read the
> > maps?
> 
> Haven't tried this yet. We usually just restart autofs.

I think this is another misunderstanding of the problem I have.

The description sounded like it was the restart with a modified
map that resulted in the problem but based on this and your later
reply it sounds like the restart fixes the problem.

That implies that modifying the map results in this automount
becoming unresponsive at some later time after the map change.

Have I got it right now?

Is there anything in the debug log about a map re-read (and
following log entries from that) between the time the map is
deployed and when the problem occurrs?

Are you sure you're getting all the debug logging?
If your assuming that setting "loggin = debug" in the autofs
configuration and using syslog with a default configuration
you might not be. How have setup to collect the debug log?

> 
> 
> > I have seen situations where an automount point becomes unresponsive
> > even though the kernel dentry flags appeared as they should to trigger
> > the automount.
> > 
> > I could never work out how this occurred but one possibility occurred
> > to me recently, see below.
> > 
> > > 
> > > We were initially mounting the following directory structure
> > > 
> > > (1)
> > > /base
> > >       +--sub1
> > >           +-- sub11
> > >           |    +-- [83 subdirectories via auto.sub11]
> > >           +-- sub12
> > >           +-- sub13
> > >           +-- sub14
> > >           +-- sub15
> > >           +-- sub16
> > >           |    +-- sub161
> > >           +-- sub17
> > > 
> > > 
> > > but needed to change this to
> > > 
> > > (2)
> > > /base
> > >       +--sub1
> > >       |   +-- sub11
> > >       |   |    +-- [83 subdirectories via auto.sub11]
> > >       |   +-- sub12
> > >       |   +-- sub13
> > >       |   +-- sub14
> > >       |   +-- sub15
> > >       |   +-- sub16
> > >       |        +-- sub161
> > >       +-- sub17
> > > 
> > > 
> > > In /etc/auto.master we have:
> > > 
> > > --------------------------
> > > /base         /etc/auto.base         browse
> > > --------------------------
> > > 
> > > (there are others, but only the mounts configured through /etc/auto.base
> > > are affected by the problem)
> > > 
> > > 
> > > There are two variants of /etc/auto.base.
> > > 
> > > This variant works fine and doesn't seem to trigger any errors.  It
> > > represents the directory structure (1) which we needed to change :
> > > --------------------------
> > > sub1 /sub11    -fstype=autofs,vers=3,sec=sys file:/etc/auto.sub11
> > >                                         \
> > >            /sub12     -fstype=nfs,vers=3,sec=sys
> > > share.big-fs1:/ifs/data/group/base/sub12     \
> > >            /sub13     -fstype=nfs,vers=3,sec=krb5
> > > pool3.fast-fs1:/ifs/data/group1-sub13         \
> > >            /sub14         -fstype=nfs,vers=3,sec=sys
> > > share.big-fs2:/ifs/data/group/base/sub14         \
> > >            /sub15 -fstype=nfs,vers=3,sec=sys
> > > share.big-fs2:/ifs/data/group/base/sub15 \
> > >            /sub17       -fstype=nfs,vers=3,sec=sys
> > > sub17.big-fs2:/ifs/sub17/data/sub17           \
> > >            /sub16/sub161 -fstype=nfs,vers=3,sec=sys
> > > share.big-fs2:/ifs/data/group/base/sub16
> > > #        /imaging file:/etc/auto.imaging
> > > # sub12 is share.big-fs1:/ifs/data/group/base/sub12
> > > --------------------------
> > > 
> > > 
> > > This one results in autofs to be broken after a while on some nodes.  It
> > > represents the directory structure (2) (see above):
> > > --------------------------
> > > sub1 /sub11    -fstype=autofs,vers=3,sec=sys file:/etc/auto.sub11
> > >                                         \
> > >            /sub12     -fstype=nfs,vers=3,sec=sys
> > > share.big-fs1:/ifs/data/group/base/sub12     \
> > >            /sub13     -fstype=nfs,vers=3,sec=krb5
> > > pool3.fast-fs1:/ifs/data/group1-sub13         \
> > >            /sub14         -fstype=nfs,vers=3,sec=sys
> > > share.big-fs2:/ifs/data/group/base/sub14         \
> > >            /sub15 -fstype=nfs,vers=3,sec=sys
> > > share.big-fs2:/ifs/data/group/base/sub15 \
> > >            /sub16/sub161 -fstype=nfs,vers=3,sec=sys
> > > share.big-fs2:/ifs/data/group/base/sub16
> > > 
> > > sub17       -fstype=nfs,vers=3,sec=sys   sub17.big-
> > > fs2:/ifs/sub17/data/sub17
> > > --------------------------
> 
> Overnight I had this version of the file active:
> 
> --------------------------
> sub1 /sub11    -fstype=autofs,vers=3,sec=sys file:/etc/auto.sub11 
>                          \
>           /sub12     -fstype=nfs,vers=3,sec=sys 
> share.big-fs1:/ifs/data/group/base/sub12     \
>           /sub13     -fstype=nfs,vers=3,sec=krb5 
> pool3.fast-fs1:/ifs/data/group1-sub13         \
>           /sub14         -fstype=nfs,vers=3,sec=sys 
> share.big-fs2:/ifs/data/group/base/sub14         \
>           /sub15 -fstype=nfs,vers=3,sec=sys 
> share.big-fs2:/ifs/data/group/base/sub15 \
>           /sub17       -fstype=nfs,vers=3,sec=sys sub17.big-fs2:/ifs/sub17 \
>           /sub16/sub161 -fstype=nfs,vers=3,sec=sys 
> share.big-fs2:/ifs/data/group/base/sub16
> 
> sub17       -fstype=nfs,vers=3,sec=sys   sub17.big-fs2:/ifs/sub17/data/sub17
> --------------------------
> 
> note that "sub17" appears twice in two locations.  This version didn't 
> trigger the issue so far.

As long as I understand the problem correctly I can try and reproduce
it, lets go down that path.

Setting up a Kerberos test environment is rather painful, not sure
I'll be able to do that here at home so I'll try without it to start
with.

Ian