Hello all, I'm looking for review input for the pivot_root(2) manual page, which I have substantially rewritten. The original page was written 19 years ago, and has seen little revision since that time. It contains a number of errors. Even at the time it was first released, the manual page already had some inaccuracies, since it was written before the final release of the system call, whose implementation was subsequently changed, but the manual page was not updated to reflect those changes. The revised page is more than 2.5 times the size of the previous page, and now includes an example program. As well as fixing a number of errors and adding many missing details, the page also adds a description of the pivot_root(".", ".") technique. I would be happy to receive error corrections and notes on missing details that should be added to the page. The rendered page is shown below. The page source can be found in the Git repo at https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git One area of the page that I'm still not really happy with is the "vague" wording in the second paragraph and the note in the third paragraph about the system call possibly changing. These pieces survive (in somewhat modified form) from the original page, which was written before the system call was released, and it seems there was some question about whether the system call might still change its behavior with respect to the root directory and current working directory of other processes. However, after 19 years, nothing has changed, and surely it will not in the future, since that would constitute an ABI breakage. I'm considering to rewrite these pieces to exactly describe what the system call does (which I already do in the third paragraph) and remove the "may or may not" pieces in the second paragraph. I'd welcome comments on making that change. The rendered page is shown below. The page source is at https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/pivot_root.2 in the Git repo at https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git Thanks, Michael NAME pivot_root - change the root filesystem SYNOPSIS int pivot_root(const char *new_root, const char *put_old); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION pivot_root() changes the root filesystem in the mount namespace of the calling process. More precisely, it moves the root filesystem to the directory put_old and makes new_root the new root filesys‐ tem. The calling process must have the CAP_SYS_ADMIN capability in the user namespace that owns the caller's mount namespace. pivot_root() may or may not change the current root and the cur‐ rent working directory of any processes or threads that use the old root directory and which are in the same mount namespace as the caller of pivot_root(). The caller of pivot_root() should ensure that processes with root or current working directory at the old root operate correctly in either case. An easy way to ensure this is to change their root and current working directory to new_root before invoking pivot_root(). Note also that pivot_root() may or may not affect the calling process's current working directory. It is therefore recommended to call chdir("/") immediately after pivot_root(). The paragraph above is intentionally vague because at the time when pivot_root() was first implemented, it was unclear whether its affect on other process's root and current working directo‐ ries—and the caller's current working directory—might change in the future. However, the behavior has remained consistent since this system call was first implemented: pivot_root() changes the root directory and the current working directory of each process or thread in the same mount namespace to new_root if they point to the old root directory. (See also NOTES.) On the other hand, pivot_root() does not change the caller's current working direc‐ tory (unless it is on the old root directory), and thus it should be followed by a chdir("/") call. The following restrictions apply: - new_root and put_old must be directories. - new_root and put_old must not be on the same filesystem as the current root. In particular, new_root can't be "/" (but can be a bind mounted directory on the current root filesystem). - put_old must be at or underneath new_root; that is, adding a nonnegative number of /.. to the string pointed to by put_old must yield the same directory as new_root. - new_root must be a mount point. (If it is not otherwise a mount point, it suffices to bind mount new_root on top of itself.) - The propagation type of the parent mount of new_root and the parent mount of the current root directory must not be MS_SHARED; similarly, if put_old is an existing mount point, its propagation type must not be MS_SHARED. These restrictions ensure that pivot_root() never propagates any changes to another mount namespace. - The current root directory must be a mount point. RETURN VALUE On success, zero is returned. On error, -1 is returned, and errno is set appropriately. ERRORS pivot_root() may fail with any of the same errors as stat(2). Additionally, it may fail with the following errors: EBUSY new_root or put_old is on the current root filesystem. (This error covers the pathological case where new_root is "/".) EINVAL new_root is not a mount point. EINVAL put_old is not underneath new_root. EINVAL The current root directory is not a mount point (because of an earlier chroot(2)). EINVAL The current root is on the rootfs (initial ramfs) filesys‐ tem; see NOTES. EINVAL Either the mount point at new_root, or the parent mount of that mount point, has propagation type MS_SHARED. EINVAL put_old is a mount point and has the propagation type MS_SHARED. ENOTDIR new_root or put_old is not a directory. EPERM The calling process does not have the CAP_SYS_ADMIN capa‐ bility. VERSIONS pivot_root() was introduced in Linux 2.3.41. CONFORMING TO pivot_root() is Linux-specific and hence is not portable. NOTES Glibc does not provide a wrapper for this system call; call it using syscall(2). A command-line interface for this system call is provided by pivot_root(8). pivot_root() allows the caller to switch to a new root filesystem while at the same time placing the old root mount at a location under new_root from where it can subsequently be unmounted. (The fact that it moves all processes that have a root directory or current working directory on the old root filesystem to the new root filesystem frees the old root filesystem of users, allowing it to be unmounted more easily.) A typical use of pivot_root() is during system startup, when the system mounts a temporary root filesystem (e.g., an initrd), then mounts the real root filesystem, and eventually turns the latter into the current root of all relevant processes or threads. A modern use is to set up a root filesystem during the creation of a container. The fact that pivot_root() modifies process root and current work‐ ing directories in the manner noted in DESCRIPTION is necessary in order to prevent kernel threads from keeping the old root direc‐ tory busy with their root and current working directory, even if they never access the filesystem in any way. new_root and put_old may be the same directory. In particular, the following sequence allows a pivot-root operation without need‐ ing to create and remove a temporary directory: chdir(new_root); mount("", ".", MS_SLAVE | MS_REC, NULL); /* Or: MS_PRIVATE | MS_REC */ pivot_root(".", "."); umount2(".", MNT_DETACH); This sequence succeeds because the pivot_root() call stacks the old root mount point (old_root) on top of the new root mount point at /. At that point, the calling process's root directory and current working directory refer to the new root mount point (new_root). During the subsequent umount() call, resolution of "." starts with new_root and then moves up the list of mounts stacked at /, with the result that old_root is unmounted. The rootfs (initial ramfs) cannot be pivot_root()ed. The recom‐ mended method of changing the root filesystem in this case is to delete everything in rootfs, overmount rootfs with the new root, attach stdin/stdout/stderr to the new /dev/console, and exec the new init(1). Helper programs for this process exist; see switch_root(8). EXAMPLE The program below demonstrates the use of pivot_root() inside a mount namespace that is created using clone(2). After pivoting to the root directory named in the program's first command-line argu‐ ment, the child created by clone(2) then executes the program named in the remaining command-line arguments. We demonstrate the program by creating a directory that will serve as the new root filesystem and placing a copy of the (statically linked) busybox(1) executable in that directory. $ mkdir /tmp/rootfs $ ls -id /tmp/rootfs # Show inode number of new root directory 319459 /tmp/rootfs $ cp $(which busybox) /tmp/rootfs $ PS1='bbsh$ ' sudo ./pivot_root_demo /tmp/rootfs /busybox sh bbsh$ PATH=/ bbsh$ busybox ln busybox ln bbsh$ ln busybox echo bbsh$ ln busybox ls bbsh$ ls busybox echo ln ls bbsh$ ls -id / # Compare with inode number above 319459 / bbsh$ echo 'hello world' hello world Program source /* pivot_root_demo.c */ #define _GNU_SOURCE #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/wait.h> #include <sys/syscall.h> #include <sys/mount.h> #include <sys/stat.h> #include <limits.h> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) static int pivot_root(const char *new_root, const char *put_old) { return syscall(SYS_pivot_root, new_root, put_old); } #define STACK_SIZE (1024 * 1024) static int /* Startup function for cloned child */ child(void *arg) { char **args = arg; char *new_root = args[0]; const char *put_old = "/oldrootfs"; char path[PATH_MAX]; /* Ensure that 'new_root' and its parent mount don't have shared propagation (which would cause pivot_root() to return an error), and prevent propagation of mount events to the initial mount namespace */ if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) == 1) errExit("mount-MS_PRIVATE"); /* Ensure that 'new_root' is a mount point */ if (mount(new_root, new_root, NULL, MS_BIND, NULL) == -1) errExit("mount-MS_BIND"); /* Create directory to which old root will be pivoted */ snprintf(path, sizeof(path), "%s/%s", new_root, put_old); if (mkdir(path, 0777) == -1) errExit("mkdir"); /* And pivot the root filesystem */ if (pivot_root(new_root, path) == -1) errExit("pivot_root"); /* Switch the current working working directory to "/" */ if (chdir("/") == -1) errExit("chdir"); /* Unmount old root and remove mount point */ if (umount2(put_old, MNT_DETACH) == -1) perror("umount2"); if (rmdir(put_old) == -1) perror("rmdir"); /* Execute the command specified in argv[1]... */ execv(args[1], &args[1]); errExit("execv"); } int main(int argc, char *argv[]) { /* Create a child process in a new mount namespace */ char *stack = malloc(STACK_SIZE); if (stack == NULL) errExit("malloc"); if (clone(child, stack + STACK_SIZE, CLONE_NEWNS | SIGCHLD, &argv[1]) == -1) errExit("clone"); /* Parent falls through to here; wait for child */ if (wait(NULL) == -1) errExit("wait"); exit(EXIT_SUCCESS); } SEE ALSO chdir(2), chroot(2), mount(2), stat(2), initrd(4), mount_names‐ paces(7), pivot_root(8), switch_root(8) -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/