Issue with automounter file breaks cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

We have a weird issue with an automounter file, where a one-line change leads to the situation, that after a while (10-30 minutes?, maybe less), some of the mounts stop working on some(!) nodes (not always the same) of our HPC cluster. This issue currently breaks our complete HPC cluster. To make it worse, it is not really reproducible. However after three days of checking back and forth we are desperate and hope to find some helpful hint through the this maillist.

The symptom is, that trying to access one of the "virtual" automounted directories is answered with "No such file or directory" (see below). autofs doesn't seem to even try to mount the respective filesystem and I cannot recognize any previous problem in the log (debugging mode).

We were initially mounting the following directory structure

(1)
/base
    +--sub1
        +-- sub11
        |    +-- [83 subdirectories via auto.sub11]
        +-- sub12
        +-- sub13
        +-- sub14
        +-- sub15
        +-- sub16
        |    +-- sub161
        +-- sub17


but needed to change this to

(2)
/base
    +--sub1
    |   +-- sub11
    |   |    +-- [83 subdirectories via auto.sub11]
    |   +-- sub12
    |   +-- sub13
    |   +-- sub14
    |   +-- sub15
    |   +-- sub16
    |        +-- sub161
    +-- sub17


In /etc/auto.master we have:

--------------------------
/base         /etc/auto.base         browse
--------------------------

(there are others, but only the mounts configured through /etc/auto.base are affected by the problem)


There are two variants of /etc/auto.base.

This variant works fine and doesn't seem to trigger any errors. It represents the directory structure (1) which we needed to change :
--------------------------
sub1 /sub11 -fstype=autofs,vers=3,sec=sys file:/etc/auto.sub11 \ /sub12 -fstype=nfs,vers=3,sec=sys share.big-fs1:/ifs/data/group/base/sub12 \ /sub13 -fstype=nfs,vers=3,sec=krb5 pool3.fast-fs1:/ifs/data/group1-sub13 \ /sub14 -fstype=nfs,vers=3,sec=sys share.big-fs2:/ifs/data/group/base/sub14 \ /sub15 -fstype=nfs,vers=3,sec=sys share.big-fs2:/ifs/data/group/base/sub15 \ /sub17 -fstype=nfs,vers=3,sec=sys sub17.big-fs2:/ifs/sub17/data/sub17 \ /sub16/sub161 -fstype=nfs,vers=3,sec=sys share.big-fs2:/ifs/data/group/base/sub16
#        /imaging file:/etc/auto.imaging
# sub12 is share.big-fs1:/ifs/data/group/base/sub12
--------------------------


This one results in autofs to be broken after a while on some nodes. It represents the directory structure (2) (see above):
--------------------------
sub1 /sub11 -fstype=autofs,vers=3,sec=sys file:/etc/auto.sub11 \ /sub12 -fstype=nfs,vers=3,sec=sys share.big-fs1:/ifs/data/group/base/sub12 \ /sub13 -fstype=nfs,vers=3,sec=krb5 pool3.fast-fs1:/ifs/data/group1-sub13 \ /sub14 -fstype=nfs,vers=3,sec=sys share.big-fs2:/ifs/data/group/base/sub14 \ /sub15 -fstype=nfs,vers=3,sec=sys share.big-fs2:/ifs/data/group/base/sub15 \ /sub16/sub161 -fstype=nfs,vers=3,sec=sys share.big-fs2:/ifs/data/group/base/sub16

sub17       -fstype=nfs,vers=3,sec=sys   sub17.big-fs2:/ifs/sub17/data/sub17
--------------------------

I fail to see any syntax problems in any of these two files variants.

/etc/auto.sub11 is a list of 83 subdirectories for /base/sub1/sub11/. No further mount options are defined in /etc/auto.sub11. It's a plain "key server:share" list.

After deploying the "bad" file and restarting the automounter everything works fine. After a while, problems start happening:

--------------------------
$ ls /base/sub1/
sub11  sub16  sub14  sub13  sub17  sub12  sub15
$ ls /base/sub1/sub11/
ls: cannot open directory /base/sub1/sub11/: No such file or directory
$
--------------------------

We found, that if we constantly run `\ls /base/sub1/sub11`, e.g. in a loop, once per minute, then the problem doesn't seem to emerge.

We are running autofs 5.0.7, release 70.el7_4.1 on CentOS 7.4.1708. Updating the CentOS release ist not possible due to hardware and software constraints.


I can provide the debug log from a timepoint shortly after the automounter has been restarted (everything ok) until a timepoint, when the problem occurs.


Any hint is greatly appreciated.
frank





[Index of Archives]     [Linux Filesystem Development]     [Linux Ext4]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux