Hello Christian, On 8/12/21 10:38 AM, Christian Brauner wrote: > On Thu, Aug 12, 2021 at 07:36:54AM +0200, Michael Kerrisk (man-pages) wrote: >> [CC += Eric, in case he has a comment on the last piece] [...] >>> That's really splitting hairs. >> >> To be clear, I'm not trying to split hairs :-). It's just that >> I'm struggling a little to understand. (In particular, the notion >> of locked mounts is one where my understanding is weak.) >> >> And think of it like this: I am the first line of defense for the >> user-space reader. If I am having trouble to understand the text, >> I wont be alone. And often, the problem is not so much that the >> text is "wrong", it's that there's a difference in background >> knowledge between what you know and what the reader (in this case >> me) knows. Part of my task is to fill that gap, by adding info >> that I think is necessary to the page (with the happy side >> effect that I learn along the way.) > > All very good points. > I didn't mean to complain btw. Sorry that it seemed that way. :) No problem. I need to think more carefully about my words sometimes in mails too :-) >>> Of course this means that we're >>> propagating into a mount namespace that is owned by a different user >>> namespace though "crossing user namespaces" might have been the better >>> choice. >> >> This is a perfect example of the point I make above. You say "of course", >> but I don't have the background knowledge that you do :-). From my >> perspective, I want to make sure that I understand your meaning, so >> that that meaning can (IMHO) be made easier for the average reader >> of the manual page. >> >>>> the aforementioned flags to protect these sensitive >>>> properties from being altered. >>>> >>>> • A new mount and user namespace pair is created. This >>>> happens for example when specifying CLONE_NEWUSER | >>>> CLONE_NEWNS in unshare(2), clone(2), or clone3(2). The >>>> aforementioned flags become locked to protect user name‐ >>>> spaces from altering sensitive mount properties. >>>> >>>> Again, this seems imprecise. Should it say something like: >>>> "... to prevent changes to sensitive mount properties in the new >>>> mount namespace" ? Or perhaps you have a better wording. >>> >>> That's not imprecise. >> >> Okay -- poor choice of wording on my part: >> >> s/this seems imprecise/I'm having trouble understanding this/ >> >>> What you want to protect against is altering >>> sensitive mount properties from within a user namespace irrespective of >>> whether or not the user namespace actually owns the mount namespace, >>> i.e. even if you own the mount namespace you shouldn't be able to alter >>> those properties. I concede though that "protect" should've been >>> "prevent". >> >> Can I check my education here please. The point is this: >> >> * The mount point was created in a mount NS that was owned by >> a more privileged user NS (e.g., the initial user NS). >> * A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) >> mount NS. >> * In the new mount NS, the mounts become locked. >> >> And, help me here: is it correct that the reason the properties >> need to be locked is because they are shared between the mounts? > > Yes, basically. Yes, but that last sentence of mine was wrong, wasn't it? The properties are not actually shared between the mounts, right? (Earlier, I had done in experiment which misled e into thinking there was sharing, but now it looks to me like there is not.) > The new mount namespace contains a copy of all the mounts in the > previous mount namespace. So they are separate mounts which you can best > see when you do unshare --mount --propagation=private. An unmount in the > new mount namespace won't affect the mount in the previous mount > namespace. Which can only nicely work if they are separate mounts. > Propagation relies (among other things) on the fact that mount > namespaces have copies of the mounts. > > The copied mounts in the new mount namespace will have inherited all > properties they had at the time when copy_namespaces() and specifically > copy_mnt_ns() was called. Which calls into copy_tree() and ultimately > into the appropriately named clone_mnt(). This is the low-level routine > that is responsible for cloning the mounts including their mount > properties. > > Some mount properties such as read-only, nodev, noexec, nosuid, atime - > while arguably not per se security mechanisms - are used for protection > or as security measures in userspace applications. The most obvious one > might be the read-only property. One wouldn't want to expose a set of > files as read-only only for someone else to trivially gain write access > to them. An example of where that could happen is when creating a new > mount namespaces and user namespace pair where the new mount namespace > is owned by the new user namespace in which the caller is privileged and > thus the caller would also able to alter the new mount namespace. So > without locking flags all it would take to turn a read-only into a > read-write mount is: > unshare -U --map-root --propagation=private -- mount -o remount,rw /some/mnt > locking such flags prevents that from happening. Thanks for the detailed explanation; it's very helpful. >>> You could probably say: >>> >>> A new mount and user namespace pair is created. This >>> happens for example when specifying CLONE_NEWUSER | >>> CLONE_NEWNS in unshare(2), clone(2), or clone3(2). >>> The aforementioned flags become locked in the new mount >>> namespace to prevent sensitive mount properties from being >>> altered. >>> Since the newly created mount namespace will be owned by the >>> newly created user namespace a caller privileged in the newly >>> created user namespace would be able to alter senstive >>> mount properties. For example, without locking the read-only >>> property for the mounts in the new mount namespace such a caller >>> would be able to remount them read-write. >> >> So, I've now made the text: >> >> EPERM One of the mounts had at least one of MOUNT_ATTR_NOATIME, >> MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC, >> MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is >> locked. Mount attributes become locked on a mount if: >> >> • A new mount or mount tree is created causing mount >> propagation across user namespaces (i.e., propagation to >> a mount namespace owned by a different user namespace). >> The kernel will lock the aforementioned flags to prevent >> these sensitive properties from being altered. >> >> • A new mount and user namespace pair is created. This >> happens for example when specifying CLONE_NEWUSER | >> CLONE_NEWNS in unshare(2), clone(2), or clone3(2). The >> aforementioned flags become locked in the new mount >> namespace to prevent sensitive mount properties from >> being altered. Since the newly created mount namespace >> will be owned by the newly created user namespace, a >> calling process that is privileged in the new user >> namespace would—in the absence of such locking—be able >> to alter senstive mount properties (e.g., to remount a >> mount that was marked read-only as read-write in the new >> mount namespace). >> >> Okay? > > Sounds good. Okay. >>> (Fwiw, in this scenario there's a bit of (moderately sane) strangeness. >>> A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to >>> kick in. For all mounts not marked as expired MNT_LOCKED will be set >>> which means that a umount() on any such mount copied from the previous >>> mount namespace will yield EINVAL implying from userspace' perspective >>> it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos >>> - whereas a remount to alter a locked flag will yield EPERM.) >> >> Thanks for educating me! So, is that what we are seeing below? (Was your silence to the above question an implicit "yes"?) >> $ sudo umount /mnt/m1 >> $ sudo mount -t tmpfs none /mnt/m1 >> $ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1 >> umount: /mnt/m1: not mounted. >> $ grep ^umount /tmp/log >> umount2("/mnt/m1", 0) = -1 EINVAL (Invalid argument) >> >> The mount_namespaces(7) page has for a log time had this text: >> >> * Mounts that come as a single unit from a more privileged mount >> namespace are locked together and may not be separated in a >> less privileged mount namespace. (The unshare(2) CLONE_NEWNS >> operation brings across all of the mounts from the original >> mount namespace as a single unit, and recursive mounts that >> propagate between mount namespaces propagate as a single unit.) >> >> I have had trouble understanding that. But maybe you just helped. >> Is that text relevant to what you just wrote above? In particular, >> I have trouble understanding what "separated" means. But, perhaps > > The text gives the "how" not the "why". Yes, that's a big problem :-}. > Consider a more elaborate mount tree where e.g., you have bind-mounted a > mount over a subdirectory of another mount: > > sudo mount -t tmpfs /mnt > sudo mkdir /mnt/my-dir/ > sudo touch /mnt/my-dir/my-file > sudo mount --bind /opt /mnt/my-dir > > The files underneath /mnt/my-dir are now hidden. Consider what would > happen if one would allow to address those mounts separately. A user > could then do: > > unshare -U --map-root --mount > umount /mnt/my-dir > cat /mnt/my-dir/my-file > > giving them access to what's in my-dir. > > Treating such mount trees as a unit in less privileged mount namespaces > (cf. [1]) prevents that, i.e., prevents revealing files and directories > that were overmounted. Got it! > Treating such mounts as a unit is also relevant when e.g. bind-mounting > a mount tree containing locked mounts. Sticking with the example above: > > unshare -U --map-root --mount > > # non-recursive bind-mount will fail > mount --bind /mnt /tmp > > # recursive bind-mount will succeed > mount --rbind /mnt /tmp > > The reason is again that the mount tree at /mnt is treated as a mount > unit because it is locked. If one were to allow to non-recursively > bind-mountng /mnt somewhere it would mean revealing what's underneath > the mount at my-dir (This is in some sense the inverse of preventing a > filesystem from being mounted that isn't fully visible, i.e. contains > hidden or over-mounted mounts.). Got it! > These semantics, in addition to being security relevant, also allow a > more privileged mount namespace to create a restricted view of the > filesystem hierarchy that can't be circumvented in a less privileged > mount namespace (Otherwise pivot_root would have to be used which can > also be used to guarantee a restriced view on the filesystem hierarchy > especially when combined with a separate rootfs.). Okay. Christian, thanks for so generously taking the time to write this up. It really helped me a lot! I will do some work on the mount namespaces manual page, to cover at least part of what you said. Thanks, Michael > Christian > > [1]: I'll avoid jumping through the hoops of speaking about ownership > all the time now for the sake of brevity. Otherwise I'll still sit > here at lunchtime. > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/