Hello Eric, Thank you for you response. On 8/17/21 5:51 PM, Eric W. Biederman wrote: > "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes: > >> Hi Eric, >> >> Thanks for your feedback! >> >> On 8/16/21 6:03 PM, Eric W. Biederman wrote: >>> Michael Kerrisk <mtk.manpages@xxxxxxxxx> writes: >>> >>>> For a long time, this manual page has had a brief discussion of >>>> "locked" mounts, without clearly saying what this concept is, or >>>> why it exists. Expand the discussion with an explanation of what >>>> locked mounts are, why mounts are locked, and some examples of the >>>> effect of locking. >>>> >>>> Thanks to Christian Brauner for a lot of help in understanding >>>> these details. >>>> >>>> Reported-by: Christian Brauner <christian.brauner@xxxxxxxxxx> >>>> Signed-off-by: Michael Kerrisk <mtk.manpages@xxxxxxxxx> >>>> --- >>>> >>>> Hello Eric and others, >>>> >>>> After some quite helpful info from Chrstian Brauner, I've expanded >>>> the discussion of locked mounts (a concept I didn't really have a >>>> good grasp on) in the mount_namespaces(7) manual page. I would be >>>> grateful to receive review comments, acks, etc., on the patch below. >>>> Could you take a look please? >>>> >>>> Cheers, >>>> >>>> Michael >>>> >>>> man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 73 insertions(+) >>>> >>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7 >>>> index e3468bdb7..97427c9ea 100644 >>>> --- a/man7/mount_namespaces.7 >>>> +++ b/man7/mount_namespaces.7 >>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original >>>> mount namespace as a single unit, >>>> and recursive mounts that propagate between >>>> mount namespaces propagate as a single unit.) >>>> +.IP >>>> +In this context, "may not be separated" means that the mounts >>>> +are locked so that they may not be individually unmounted. >>>> +Consider the following example: >>>> +.IP >>>> +.RS >>>> +.in +4n >>>> +.EX >>>> +$ \fBsudo mkdir /mnt/dir\fP >>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP >>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP >>>> +$ \fBls /mnt/dir\fP # Former contents of directory are invisible >>> >>> Do we want a more motivating example such as a /proc/sys? >>> >>> It has been common to mount over /proc files and directories that can be >>> written to by the global root so that users in a mount namespace may not >>> touch them. >> >> Seems reasonable. But I want to check one thing. Can you please >> define "global root". I'm pretty sure I know what you mean, but >> I'd like to know your definition. > > I mean uid 0 in the initial user namespace. (Good. That's what I thought you meant. So far, that term is not described in the manual pages. I just now added a definition of the term to user_namespaces(7).) > This uid owns most of files in /proc. > > Container systems that don't want to use user namespaces frequently > mount over files in proc to prevent using some of the root privileges > that come simply by having uid 0. > > Another use is mounting over files on virtual filesystems like proc > to reduce the attack surface. Thanks for the background. I think for the moment I will go with Christian's alternative suggestion (an example using /etc/shadow). > For reducing what the root user in a container can do, I think using user > namespaces and using a uid other than 0 in the initial user namespace. > > >>>> +.EE >>>> +.in >>>> +.RE >>>> +.IP >>>> +The above steps, performed in a more privileged user namespace, >>>> +have created a (read-only) bind mount that >>>> +obscures the contents of the directory >>>> +.IR /mnt/dir . >>>> +For security reasons, it should not be possible to unmount >>>> +that mount in a less privileged user namespace, >>>> +since that would reveal the contents of the directory >>>> +.IR /mnt/dir . >>> > +.IP >>>> +Suppose we now create a new mount namespace >>>> +owned by a (new) subordinate user namespace. >>>> +The new mount namespace will inherit copies of all of the mounts >>>> +from the previous mount namespace. >>>> +However, those mounts will be locked because the new mount namespace >>>> +is owned by a less privileged user namespace. >>>> +Consequently, an attempt to unmount the mount fails: >>>> +.IP >>>> +.RS >>>> +.in +4n >>>> +.EX >>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP >>>> + \fBstrace \-o /tmp/log \e\fP >>>> + \fBumount /mnt/dir\fP >>>> +umount: /mnt/dir: not mounted. >>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP >>>> +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument) >>>> +.EE >>>> +.in >>>> +.RE >>>> +.IP >>>> +The error message from >>>> +.BR mount (8) >>>> +is a little confusing, but the >>>> +.BR strace (1) >>>> +output reveals that the underlying >>>> +.BR umount2 (2) >>>> +system call failed with the error >>>> +.BR EINVAL , >>>> +which is the error that the kernel returns to indicate that >>>> +the mount is locked. >>> >>> Do you want to mention that you can unmount the entire subtree? Either >>> with pivot_root if it is locked to "/" or with >>> "umount -l /path/to/propagated/directory". >> >> Yes, I wondered about that, but hadn't got round to devising >> the scenario. How about this: >> >> [[ >> * Following on from the previous point, note that it is possible >> to unmount an entire tree of mounts that propagated as a unit > ^^^^^ subtree? Yes, probably better, to prevent misunderstandings. Changed (and in a few other places also). >> into a mount namespace that is owned by a less privileged user >> namespace, as illustrated in the following example. > >> >> First, we create new user and mount namespaces using >> unshare(1). In the new mount namespace, the propagation type >> of all mounts is set to private. We then create a shared bind >> mount at /mnt, and a small hierarchy of mount points underneath >> that mount point. >> >> $ PS1='ns1# ' sudo unshare --user --map-root-user \ >> --mount --propagation private bash >> ns1# echo $$ # We need the PID of this shell later >> 778501 >> ns1# mount --make-shared --bind /mnt /mnt >> ns1# mkdir /mnt/x >> ns1# mount --make-private -t tmpfs none /mnt/x >> ns1# mkdir /mnt/x/y >> ns1# mount --make-private -t tmpfs none /mnt/x/y >> ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' >> 986 83 8:5 /mnt /mnt rw,relatime shared:344 >> 989 986 0:56 / /mnt/x rw,relatime >> 990 989 0:57 / /mnt/x/y rw,relatime >> >> Continuing in the same shell session, we then create a second >> shell in a new mount namespace and a new subordinate (and thus >> less privileged) user namespace and check the state of the >> propagated mount points rooted at /mnt. >> >> ns1# PS1='ns2# unshare --user --map-root-user \ >> --mount --propagation unchanged bash >> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' >> 1239 1204 8:5 /mnt /mnt rw,relatime master:344 >> 1240 1239 0:56 / /mnt/x rw,relatime >> 1241 1240 0:57 / /mnt/x/y rw,relatime >> >> Of note in the above output is that the propagation type of the >> mount point /mnt has been reduced to slave, as explained near >> the start of this subsection. This means that submount events >> will propagate from the master /mnt in "ns1", but propagation >> will not occur in the opposite direction. >> >> From a separate terminal window, we then use nsenter(1) to >> enter the mount and user namespaces corresponding to "ns1". In >> that terminal window, we then recursively bind mount /mnt/x at >> the location /mnt/ppp. >> >> $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount >> ns3# mount --rbind --make-private /mnt/x /mnt/ppp >> ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' >> 986 83 8:5 /mnt /mnt rw,relatime shared:344 >> 989 986 0:56 / /mnt/x rw,relatime >> 990 989 0:57 / /mnt/x/y rw,relatime >> 1242 986 0:56 / /mnt/ppp rw,relatime >> 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518 >> >> Because the propagation type of the parent mount, /mnt, was >> shared, the recursive bind mount propagated a small tree of >> mounts under the slave mount /mnt into "ns2", as can be >> verified by executing the following command in that shell >> session: >> >> ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' >> 1239 1204 8:5 /mnt /mnt rw,relatime master:344 >> 1240 1239 0:56 / /mnt/x rw,relatime >> 1241 1240 0:57 / /mnt/x/y rw,relatime >> 1244 1239 0:56 / /mnt/ppp rw,relatime >> 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518 >> >> While it is not possible to unmount a part of that propagated >> subtree (/mnt/ppp/y), it is possible to unmount the entire >> tree, as shown by the following commands: >> >> ns2# umount /mnt/ppp/y >> umount: /mnt/ppp/y: not mounted. >> ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds... >> ns2# grep /mnt /proc/self/mountinfo >> 1239 1204 8:5 /mnt /mnt rw,relatime master:344 >> 1240 1239 0:56 / /mnt/x rw,relatime >> 1241 1240 0:57 / /mnt/x/y rw,relatime >> ]] >> >> ? > > Yes. > > It is worth noting that in ns2 it is also possible to mount on top of > /mnt/ppp/y and umount from /mnt/ppp/y. Yes, good point. I've added some text, and an example for that case. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/