"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes: > Hello Eric, > > On 9/30/19 2:42 PM, Eric W. Biederman wrote: >> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes: >> >>> Hello Eric, >>> >>> A ping on my question below. Could you take a look please? >>> >>> Thanks, >>> >>> Michael >>> >>>>>>> The concern from our conversation at the container mini-summit was that >>>>>>> there is a pathology if in your initial mount namespace all of the >>>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary >>>>>>> if you are going to use mount propagation), that if new_root itself >>>>>>> is MS_SHARED then unmounting the old_root could propagate. >>>>>>> >>>>>>> So I believe the desired sequence is: >>>>>>> >>>>>>>>>> chdir(new_root); >>>>>>> +++ mount("", ".", MS_SLAVE | MS_REC, NULL); >>>>>>>>>> pivot_root(".", "."); >>>>>>>>>> umount2(".", MNT_DETACH); >>>>>>> >>>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE. So >>>>>>> long as it is not MS_SHARED the mount won't propagate back to the >>>>>>> parent mount namespace. >>>>>> >>>>>> Thanks. I made that change. >>>>> >>>>> For what it is worth. The sequence above without the change in mount >>>>> attributes will fail if it is necessary to change the mount attributes >>>>> as "." is both put_old as well as new_root. >>>>> >>>>> When I initially suggested the change I saw "." was new_root and forgot >>>>> "." was also put_old. So I thought there was a silent danger without >>>>> that sequence. >>>> >>>> So, now I am a little confused by the comments you added here. Do you >>>> now mean that the >>>> >>>> mount("", ".", MS_SLAVE | MS_REC, NULL); >>>> >>>> call is not actually necessary? >> >> Apologies for being slow getting back to you. > > NP. Thanks for your reply. > >> To my knowledge there are two cases where pivot_root is used. >> - In the initial mount namespace from a ramdisk when mounting root. >> This is the original use case and somewhat historical as rootfs >> (aka an initial ramfs) may not be unmounted. >> >> - When setting up a new mount namespace to jettison all of the mounts >> you don't need. >> >> The sequence: >> >> chdir(new_root); >> pivot_root(".", "."); >> umount2(".", MNT_DETACH); >> >> is perfect for both use cases (as nothing needs to be known about the >> directory layout of the new root filesystem). >> >> In the case when you are setting up a new mount namespace propogating >> changes in the mount layout to another mount namespace is fatal. But >> that is not a concern for using that pivot_root sequence above because >> pivot_root will fail deterministically if >> 'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified. >> >> So I would document the above sequence of three system calls in the >> man-page. > > Okay. I've changed the example to be just those three calls. > >> I would document that pivot_root will fail if propagation would occur. > > Yep. That's in the page already. > >> I would document in pivot_root or under unshare(CLONE_NEWNS) that if >> mount propagation is enabled (the default with systemd) that you >> need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or >> 'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount >> namespace. Or mounts will propagate backwards, which is usually >> not what people want. > > Thanks. Instead, I have added the following text to > mount_namespaces(7), the page that is referred to by both clone(2) and > unshare(2) in their discussions of CLONE_NEWNS: > > An application that creates a new mount namespace > directly using clone(2) or unshare(2) may desire to pre‐ > vent propagation of mount events to other mount names‐ > paces (as is is done by unshare(1)). This can be done by > changing the propagation type of mount points in the new > namesapace to either MS_SLAVE or MS_PRIVATE. using a > call such as the following: > > mount(NULL, "/", MS_SLAVE | MS_REC, NULL); Yes. >> Creating of a mount namespace in a user namespace automatically does >> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount >> namespace was not created in that user namespace. AKA creating >> a mount namespace in a user namespace does the unshare for you. > > Oh -- I had forgotten that detail. But it is documented > (by you, I think) in mount_namespaces(7): > > * A mount namespace has an owner user namespace. A > mount namespace whose owner user namespace is differ‐ > ent from the owner user namespace of its parent mount > namespace is considered a less privileged mount names‐ > pace. > > * When creating a less privileged mount namespace, > shared mounts are reduced to slave mounts. (Shared > and slave mounts are discussed below.) This ensures > that mappings performed in less privileged mount > namespaces will not propagate to more privileged mount > namespaces. > > There's one point that description that troubles me. There is a > reference to "parent mount namespace", but as I understand things > there is no parental relationship among mount namespaces instances > (or am I wrong?). Should that wording not be rather something > like "the mount namespace of the process that created this mount > namespace"? How about "the mount namespace this mount namespace started as a copy of" You are absolutely correct there is no relationship between mount namespaces. There is just the propagation tree between mounts. (Which acts similarly to a parent/child relationship but is not at all the same thing). Eric