Re: [PATCH/RFC] autofs: the documentation I wanted to read

Randy Dunlap <rdunlap@xxxxxxxxxxxxx> · Wed, 30 Jul 2014 11:37:14 -0700

On 07/28/14 19:00, NeilBrown wrote:
> 
> 
> This documents autofs from the perspective of what the module actually
> supports rather than how automount is expected to use it.
> It is based mostly on code review and very little on testing so it
> may be inaccurate in some places.
> 
> The document assumes the functionality added by the RCU-walk patches
> that I posted recently.
> 
> It is formatted using "markdown" and works best with Markdown.pl
> (markdown_py doesn't like some constructs).
> 
> 
> Signed-off-by: NeilBrown <neilb@xxxxxxx>
> 
> diff --git a/Documentation/filesystems/autofs4.txt b/Documentation/filesystems/autofs4.txt
> new file mode 100644
> index 000000000000..45f67c83d713
> --- /dev/null
> +++ b/Documentation/filesystems/autofs4.txt
> @@ -0,0 +1,503 @@
> +<head>
> +<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
> +</head>
> +
> +autofs - how it works
> +=====================
> +
> +Purpose
> +-------
> +
> +The goal of autofs is to provide on-demand mounting and race free
> +automatic unmounting of various other filesystems.  This provides two
> +key advantages:
> +
> +1. There is no need to delay boot until all filesystems that
> +   might be needed are mounted.  Processes that try to access those
> +   slow filesystems might be delayed but other processes can
> +   continue freely.  This is particularly important for
> +   network filesystems (e.g. NFS) or filesystems stored on
> +   media with a media-changing robot.
> +
> +2. The names and locations of filesystems can be stored in
> +   a remote database and can change at any time.  The content
> +   in that data base at the time of access will be used to provide
> +   a target for the access.  The interpretation of names in the
> +   filesystem can even be programatic rather than database-backed,

                             programmatic

> +   allowing wildcards for example, and can vary based on the user who
> +   first accessed a name.
> +
> +Context
> +-------
> +
> +The "autofs4" filesystem module is only one part of an autofs system.
> +There also needs to be a user-space program which looks up names
> +and mounts filesystems.  This will often be the "automount" program,
> +though other tools including "systemd" can make use of "autofs4".
> +This document describes only the kernel module and the interactions
> +required with any user-space program.  Subsequent text refers to this
> +as the "automount daemon" or simply "the daemon".
> +
> +"autofs4" is a Linux kernel module with provides the "autofs"
> +filesystem type.  Several "autofs" filesystems can be mounted and they
> +can each be managed separately, or all managed by the same daemon.
> +
> +Content
> +-------
> +
> +An autofs filesystem can contain 3 sorts of objects: directories,
> +symbolic links and mount traps.  Mount traps are directories with
> +extra properties as described in the next section.
> +
> +Objects can only be created by the automount daemon: symlinks are
> +created with a regular `symlink` systemcall, while directories and
> +mount traps are created with `mkdir`.  The determination of whether a
> +directory should be a mount trap or not is quite _ad hoc_, largely for
> +historical reasons, and is determined in part the

                                         in part by the

> +*direct*/*indirect*/*offset* mount options, and the *maxproto* mount option.
> +
> +If neither the *direct* or *offset* mount options are given (so the
> +mount is considered to be *indirect*), then the root directory is
> +always a regular directory, otherwise it is a mount trap when it is
> +empty and a regular directory when not empty.  Note that *direct* and
> +*offset* are treated identically so a concise summary is that the root
> +directory is a mount trap only if the filesystem is mounted *direct*
> +and the root is empty.
> +
> +Directories created in the root directory are mount traps only if the
> +filesystem is mounted  *indirect* and they are empty.
> +
> +Directories further down the tree depend on the *max_proto* mount

                                                    max_proto or maxproto?  or either?
check/fix other places also.

> +option and particularly whether it is less than five or not.
> +When *max_proto* is five, no directories further down the
> +tree are ever mount traps, they are always regular directories.  When
> +the *max_proto* is four (or three), these directories are mount traps
> +precisely when they are empty.
> +
> +So: non-empty (i.e. non-leaf) directories are never mount traps. Empty
> +directories are sometimes mount traps, and sometimes not depending on
> +where in the tree they are (root, top level, or lower) the *maxproto*,

                                                or lower), the

> +and whether the mount was *indirect* or not.
> +
> +Mount Traps
> +---------------
> +
> +A core element of the implementation of autofs is the Mount Traps
> +which are provided by the Linux VFS.  Any directory provided by a
> +filesystem can be designated as a trap.  This involves two separate
> +features that work together to allow autofs to do its job.
> +
> +**DCACHE_NEED_AUTOMOUNT**
> +
> +If a dentry has the DCACHE_NEED_AUTOMOUNT flag set (which gets set if
> +the inode has S_AUTOMOUNT set, or can be set directly) then it is
> +(potentially) a mount trap.  Any access to this directory beyond a
> +"`stat`" will (normally) cause the `d_op->d_automount()` dentry operation
> +to be called. The task of this method is to find the filesystem that
> +should be mounted on the directory and to return it.  The VFS is
> +responsibly for actually mounting the root of this filesystem on the

   responsible

> +directory.
> +
> +autofs doesn't find the filesystem itself but sends a message to the
> +automount daemon asking it to find and mount the filesystem.  The
> +autofs `d_automount` method then waits for the daemon to report that
> +everything is ready.  It will then return "`NULL`" indicating that the
> +mount has already happened.  The VFS doesn't try to mount anything but
> +follows down the mount that is already there.
> +
> +This functionality is sufficient for some users of mount traps such
> +as NFS which creates traps so that mountpoints on the server can be
> +reflected on the client.  However it is not sufficient for autofs.  As
> +mounting onto a directory is considered to be "beyond a `stat`", the
> +automount daemon would not be able to mount a filesystem on the 'trap'
> +directory without some way to avoid getting caught in the trap.  For
> +that purpose there is another flag.
> +
> +**DCACHE_MANAGE_TRANSIT**
> +
> +If a dentry has DCACHE_MANAGE_TRANSIT set then two very different but
> +related behaviors are invoked, both using the `d_op->d_manage()`
> +dentry operation.
> +
> +Firstly, before checking to see if any filesystem is mounted on the
> +directory, d_manage() will be called with the `rcu_walk` parameter set
> +to `false`.  It may return one of three things:
> +
> +-  A return value of zero indicates that there is nothing special
> +   about this dentry and normal checks for mounts and automounts
> +   should proceed.
> +
> +   autofs normally returns zero, but first waits for any
> +   expiry (automatic unmounting of the mounted filesystem) to
> +   complete.  This avoids races.
> +
> +-  A return value of `-EISDIR` tells the VFS to ignore any mounts
> +   on the directory and to not consider calling `->d_automount()`.
> +   This effectively disables the **DCACHE_NEED_AUTOMOUNT** flag
> +   causing the directory not be a mount trap after all.
> +
> +   autofs returns this if it detects that the process performing the
> +   lookup is the automount daemon and that the mount has been
> +   requested but has not yet completed.  How it determines this is
> +   discussed later.  This allows the automount daemon not to get
> +   caught in the mount trap.
> +
> +   There is a subtlety here.  It is possibly that a second autofs

                                       possible

> +   filesystem can be mounted below the first and for both of them to
> +   be managed by the same daemon.  For the daemon to be able to mount
> +   something on the second it must be able to "walk" down past the
> +   first.  This means that d_manage cannot *always* return -EISDIR for
> +   the automount daemon.  It must only return it when a mount has
> +   been requested, but has not yet completed.
> +
> +   `d_manage` also returns `-EISDIR` if the dentry shouldn't be a
> +   mount trap, either because it is a symbolic link or because it is
> +   not empty.
> +
> +-  Any other negative value is treated as an error and returned
> +   to the caller.
> +
> +   autofs can return
> +
> +   - -ENOENT if the automount daemon failed to mount anything,
> +   - -ENOMEM if it ran out of memory,
> +   - -EINTR if a signal arrived while waiting for expiry to
> +     complete
> +   - or any other error sent down by the automount daemon.
> +
> +
> +The second use case only occurs during an "RCU-walk" and so `rcu_walk`
> +will be set.
> +
> +An RCU-walk is a fast and light weight process for walking down a

                             lightweight

> +filename path (i.e. it is like running on tip-toes).  RCU-walk cannot
> +cope with all situations so when it finds a difficulty it falls back
> +to "REF-walk", which is slower but more robust.
> +
> +RCU-walk will never call `->d_automount`, the filesystems must already

                            `->d_automount`; the

> +be mounted or RCU-walk cannot handle the path.
> +To determine if a mount-trap is safe for RCU-walk mode it calls
> +`->d_manage()` with `rcu_walk` set to `true`.
> +
> +In this case `d_manage()` must avoid blocking and should avoid taking
> +spinlocks if at all possible.  Its sole purpose is to determine if it
> +would be safe to follow down into any mounted directory and the only
> +reason that it might not be is if an expiry of the mount is
> +underway.
> +
> +In the `rcu_walk` case, `d_manage()` cannot return -EISDIR to tell the
> +VFS that this is a directory that doesn't require d_automount.  If
> +`rcu_walk` sees a dentry with DCACHE_NEED_AUTOMOUNT set but nothing
> +mounted, it *will* fall back to REF-walk.  `d_manage()` cannot make the
> +VFS remain in RCU-walk mode, but can only tell it to get out of
> +RCU-walk mode by returning `-ECHILD`.
> +
> +So `d_manage()`, when called with `rcu_walk` set, should either return
> +-ECHILD if there is any reason to believe it is unsafe to end the
> +mounted filesystem, and otherwise should return 0.
> +
> +autofs will return `-ECHILD` if an expiry of the filesystem has been
> +initiated or is being considered, otherwise it returns 0.
> +
> +
> +Mountpoint expiry
> +-----------------
> +
> +The VFS has a mechansim for automatically expiring unused mounts,
> +much as it can expire any unused dentry information from the dcache.
> +This is guided by the MNT_SHRINKABLE flag.  This  only applies to
> +mounts that were created by `d_automount()` returning a filesystem to be
> +mounted.  As autofs doesn't return such a filesystem be leaves the

                                                        but ??

> +mounting to the automount daemon, it must involve the automount daemon
> +in unmounting as well.  This also means that autofs has more control
> +of expiry.
> +
> +The VFS also supports "expiry" of mounts using the MNT_EXPIRE flag to
> +the `umount` systemcall.  Unmounting with MNT_EXPIRE will fail unless

I would spell:  system call.

> +a previous attempt had been made, and the filesystem was been inactive

                                                        has

> +and untouched since that previous attempt.  autofs4 does not depend on
> +this but has its own internal tracking of whether filesystems were
> +recently used.  This allows individual names in the autofs directory
> +to expire separately.
> +
> +With version 4 of the protocol, the automount daemon can try to
> +unmount any filesystems mounted on the autofs filesystem or remove any
> +symbolic links or empty directories any time it likes.  If the unmount
> +or removal is successful the filesystem will be returned to the state
> +it was before the mount or creation, so that any access of the name
> +will trigger normal auto-mount processing.  In particlar, `rmdir` and
> +`unlink` do not leave negative entries in the dcache as a normal
> +filesystem would, so an attempt to access a recently-removed object is
> +passed to autofs for handling.
> +
> +With version 5, this is not safe except for unmounting from top-level
> +directories.  As lower-level directories are never mount traps, other
> +processes will see an empty directory as soon as the filesystem is
> +unmounted.  So it is generally safest to use the autofs expiry
> +protocol described below.
> +
> +Normally the daemon only wants to remove entries which haven't been
> +used for a while.  For this purpose autofs maintains a "`last_used`"
> +time stamp on each directory or symlink.  For symlinks it genuinely
> +does record the last time the symlink was "used" or followed to find
> +out where it points to.  For directories the field is a slight
> +misnomer.  It actually records the last time that autofs checked if
> +the directory or one of its descendents was busy and found that it
> +was.  This is just as useful and doesn't require updating the field so
> +often.
> +
> +The daemon is able to ask autofs if anything is due to be expired,
> +using an `ioctl` as discussed later.  For a *direct* mount, autofs
> +considers if the entire mount-tree can be unmounted or not.  For an
> +*indirect* mount, autofs considers each of the names in the top level
> +directory to determine if any of those can be unmounted and cleaned
> +up.
> +
> +There is an option with indirect mounts to consider each of the leaves
> +that has been mounted on instead of considering the top-level names.
> +This is intended for compatability with version 4 of autofs and should
> +be considered as deprecated.
> +
> +When autofs considers a directory it checks the `last_used` time and
> +compares it with the "timeout" value set when the filesystem was
> +mounted, though this check is ignored in some cases. It also checks if
> +the directory or anything below it is in use.  For symbolic links,
> +only the `last_used` time is ever considered.
> +
> +If both appear to support expiring the directory or symlink, an action
> +is taken.
> +
> +There are two ways to ask autofs to consider expiry.  The first is to
> +use the **AUTOFS_IOC_EXPIRE** ioctl.  This only works for indirect
> +mounts.  If it finds something in the root directory to expire it will
> +return the name of that thing.  Once a name has been returned the
> +automount daemon needs to unmount any filesystems mounted below the
> +name normally.  As described above, this is unsafe for non-toplevel
> +mounts in a version-5 autofs.  For this reason the current `automountd`
> +does not use this ioctl.
> +
> +The second mechanism uses either the **AUTOFS_DEV_IOCTL_EXPIRE_CMD** or
> +the **AUTOFS_IOC_EXPIRE_MULTI** ioctl.  This will work for both direct and
> +indirect mounts.  If it selects an object to expire, it will notify
> +the daemon using the notification mechanism described below.  This
> +will block until the daemon acknowledges the expiry notification.
> +This implies that the "`EXPIRE`" ioctl must be sent from a different
> +thread than the one which handles notification.
> +
> +While the ioctl is blocking, the entry is marked as "expiring" and
> +`d_manage` will block until the daemon affirms that the unmount has
> +completed (together with removing any directories that might have been
> +necessary), or has been aborted.
> +
> +Communicating with autofs: detecting the daemon
> +-----------------------------------------------
> +
> +There are several forms of communication between the automount daemon
> +and the filesystem.  As we have already seen, the daemon can create and
> +remove directories and symlinks using normal filesystem operations.
> +autofs knows whether a process requesting some operation is the daemon
> +or not based on it's process-group id number (see getpgid(1)).

                   its

> +
> +When an autofs filesystem it mounted the pgid of the mounting
> +processes is recorded unless that "pgrp=" option is given, in which

                                 the

> +case that number is recorded instead.  Any request arriving from a
> +process in that process group is considered to come from the daemon.
> +If the daemon ever has to be stopped and restarted a new pgid can be
> +provided through an ioctl as will be described below.
> +
> +Communicating with autofs: the event pipe
> +-----------------------------------------
> +
> +When an autofs filesystem is mounted, the 'write' end of a pipe must
> +be passed using the 'fd=' mount option.  autofs will write
> +notification messages to this pipe for the daemon to respond to.
> +For version 5, the format of the message is:
> +
> +	struct autofs_v5_packet {
> +		int proto_version;		/* Protocol version */
> +		int type;			/* Type of packet */
> +		autofs_wqt_t wait_queue_token;
> +		__u32 dev;
> +		__u64 ino;
> +		__u32 uid;
> +		__u32 gid;
> +		__u32 pid;
> +		__u32 tgid;
> +		__u32 len;
> +		char name[NAME_MAX+1];
> +	};
> +
> +where the type is one of
> +
> +	autofs_ptype_missing_indirect
> +	autofs_ptype_expire_indirect
> +	autofs_ptype_missing_direct
> +	autofs_ptype_expire_direct
> +
> +so messages can indicate that a name is missing (something tried to
> +access it but it isn't there) or that it has been selected for expiry.
> +
> +The pipe will be set to "packet mode" (equivalent to passing
> +`O_DIRECT`) to _pipe2(2)_ so that a read from the pipe will return at
> +most one packet, and any unread portion of a packet will be discarded.
> +
> +The `wait_queue_token` is a unique number which can identify a
> +particular request to be acknowledged.  When a message is sent over
> +the pipe the affected dentry is marked as either "active" or
> +"expiring" and other accesses to it block until the messages is

                                                       message

> +acknowledged using one of the ioctls below and the relevant
> +`wait_queue_token`.
> +
> +Communicating with autofs: root directory ioctls
> +------------------------------------------------
> +
> +The root directory of an autofs filesystem will respond to a number of
> +ioctls.   The process issuing the ioctl must have the CAP_SYS_ADMIN
> +capability, or must be the automount daemon.
> +
> +The available ioctl commands are:
> +
> +- **AUTOFS_IOC_READY**: a notification has been handled.  The argument
> +    to the ioctl command is the "wait_queue_token" number
> +    corresponding to the notification being acknowledged.
> +- **AUTOFS_IOC_FAIL**: similar to above, but indicates failure with
> +    the error code `ENOENT`.
> +- **AUTOFS_IOC_CATATONIC**: Causes the autofs to enter "catatonic"
> +    mode meaning that it stops sending notifications to the daemon.
> +    This mode is also entered if a write to the pipe fails.
> +- **AUTOFS_IOC_PROTOVER**:  This returns the protocol version in use.
> +- **AUTOFS_IOC_PROTOSUBVER**: Returns the protocol sub-version which
> +    is really a version number for the implementation.  It is
> +    currently 2.
> +- **AUTOFS_IOC_SETTIMEOUT**:  This passes a pointer to an unsigned
> +    long.  The value is used to set the timeout for expiry, and
> +    the current timeout value is stored back through the pointer.
> +- **AUTOFS_IOC_ASKUMOUNT**:  Returns, in the pointed-to `int`, 1 if
> +    the filesystem could be unmounted.  This is only a hint as
> +    the situation could change at any instant.  This call can be
> +    use to avoid a more expensive full unmount attempt.
> +- **AUTOFS_IOC_EXPIRE**: as described above, this asks if there is
> +    anything suitable to expire.  A pointer to a packet:
> +
> +        struct autofs_packet_expire_multi {
> +        	int proto_version;              /* Protocol version */
> +        	int type;                       /* Type of packet */
> +        	autofs_wqt_t wait_queue_token;
> +        	int len;
> +        	char name[NAME_MAX+1];
> +        };
> +
> +     is required.  This is filled in with the name of something
> +     that can be unmounted or removed.  If nothing can be expired,
> +     `errno` is set to `EAGAIN`.  Even though a `wait_queue_token`
> +     is present in the structure, not "wait queue" is established

                                     no

> +     and no acknowledgment is needed.
> +- **AUTOFS_IOC_EXPIRE_MULTI**:  This is similar to
> +     **AUTOFS_IOC_EXPIRE** except that it causes notification to be
> +     sent to the daemon, and it blocks until the daemon acknowledges.
> +     The argument is an integer which can contain two different flags.
> +
> +     **AUTOFS_EXP_IMMEDIATE** causes `last_used` time to be ignored
> +     and objects are expired if the are not in use.
> +
> +     **AUTOFS_EXP_LEAVES** will select a leaf rather than a top-level
> +     name to expire.  This is only safe when *maxproto* is 4.
> +
> +Communicating with autofs: char-device ioctls
> +---------------------------------------------
> +
> +It is not always possible to open the root of an autofs filesystem,
> +particularly a *direct* mounted filesystem.  If the automount daemon
> +is restarted there is no way for it to regain control of existing
> +mounts using any of the above communication channels.  To address this
> +need there is a "miscellaneous" character device (major 10, minor 235)
> +which can be used to communicate directly with the autofs filesystem.
> +It requires CAP_SYS_ADMIN for access.
> +
> +The `ioctl`s that can be used on this device a described in a separate

                                                are

> +document `autofs4-mount-control.txt`, and are summarized briefly here.
> +Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure:
> +
> +	struct autofs_dev_ioctl {
> +		__u32 ver_major;
> +		__u32 ver_minor;
> +		__u32 size;             /* total size of data passed in
> +		                         * including this struct */
> +		__s32 ioctlfd;          /* automount command fd */
> +
> +		__u32 arg1;             /* Command parameters */
> +		__u32 arg2;
> +
> +		char path[0];
> +	};
> +
> +For the **OPEN_MOUNT** and **IS_MOUNTPOINT** commands, the target
> +filesystem is identified by the `path`.  All other commands identify
> +the filesystem by the `ioctlfd` which is a file descriptor open on the
> +root, and which can be returned by **OPEN_MOUNT**.
> +
> +The `ver_major` and `ver_minor` are in/out parameters which check that
> +the requested version is supported, and report the maximum version
> +that the kernel module can support.
> +
> +Commands are:
> +
> +- **AUTOFS_DEV_IOCTL_VERSION_CMD**: does nothing, except validate and
> +    set version numbers.
> +- **AUTOFS_DEV_IOCTL_OPENMOUNT_CMD**: return an open file descriptor
> +    on the root of an autofs filesystem.  The filesystem is identified
> +    by name a device number, which is stored in `arg1`.  Device

               and ??

> +    numbers for existing filesystems can be found in
> +    `/proc/self/mountinfo`.
> +- **AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD**: same as `close(ioctlfd)`.
> +- **AUTOFS_DEV_IOCTL_SETPIPEFD_CMD**: if the  filesystem is in
> +    catatonic mode, this can provide the write end of a new pipe
> +    in `arg1` to re-establish communication with a daemon.  The
> +    process group of the calling process is used to identify the
> +    daemon.
> +- **AUTOFS_DEV_IOCTL_REQUESTER_CMD**: `path` should be a
> +    name within the filesystem that as been auto-mounted on.

                                       has

> +    arg1 is the dev number of the underlying autofs.  On successful
> +    return, `arg1` and `arg2` will be the UID and GID of the process
> +    which triggered that mount.
> +
> +- **AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD**: Check if path is a
> +    mountpoint of a particular type - see separate documentation for
> +    details.
> +
> +- **AUTOFS_DEV_IOCTL_PROTOVER_CMD**:
> +- **AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD**:
> +- **AUTOFS_DEV_IOCTL_READY_CMD**:
> +- **AUTOFS_DEV_IOCTL_FAIL_CMD**:
> +- **AUTOFS_DEV_IOCTL_CATATONIC_CMD**:
> +- **AUTOFS_DEV_IOCTL_TIMEOUT_CMD**:
> +- **AUTOFS_DEV_IOCTL_EXPIRE_CMD**:
> +- **AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD**:  These all have the same
> +    function as the similarly named **AUTOFS_IOC** ioctls, except
> +    that **FAIL** can be given an explicit error number in `arg1`
> +    instead of assuming `ENOENT`, and this **EXPIRE** command
> +    corresponds to **AUTOFS_IOC_EXPIRE_MULTI**.
> +
> +Catatonic mode
> +--------------
> +
> +As mentioned, an autofs mount can enter "catatonic" mode.  This
> +happens if a write to the notification pipe fails, or if it is
> +explicitly requested by an `ioctl`.
> +
> +When entering catatonic mode, the pipe is closed and any pending
> +notifications are acknowledged with the error `ENOENT`.
> +
> +Once in catatonic mode attempts to access non-existing names will
> +result in `ENOENT` while attempts to access existing directories will
> +be treated in the same way is if they came from the daemon, so mount

                              as

> +traps will not fire.
> +
> +When the filesystem is mounted a _uid_ and _gid_ can be given which
> +set the ownership of directories and symbolic links.  When the
> +filesystem is in catatonic mode, any process with a matching UID can
> +create directories or symlink in the root directory, but not in other

                         symlinks

> +directories.
> +
> +Catatonic mode can only be left via the
> +**AUTOFS_DEV_IOCTL_OPENMOUNT_CMD** ioctl on the `/dev/autofs`.
> 

Thanks.  Nice job.

-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe autofs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html