Re: pivot_root(".", ".") and the fchdir() dance

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Mon, 07 Oct 2019 10:46:31 -0500

"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

> Hello Eric,
>
> On 9/30/19 2:42 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>>
>>> Hello Eric,
>>>
>>> A ping on my question below. Could you take a look please?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>>>>> The concern from our conversation at the container mini-summit was that
>>>>>>> there is a pathology if in your initial mount namespace all of the
>>>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>>>> if you are going to use mount propagation), that if new_root itself
>>>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>>>
>>>>>>> So I believe the desired sequence is:
>>>>>>>
>>>>>>>>>>             chdir(new_root);
>>>>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>>>             pivot_root(".", ".");
>>>>>>>>>>             umount2(".", MNT_DETACH);
>>>>>>>
>>>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>>>> parent mount namespace.
>>>>>>
>>>>>> Thanks. I made that change.
>>>>>
>>>>> For what it is worth.  The sequence above without the change in mount
>>>>> attributes will fail if it is necessary to change the mount attributes
>>>>> as "." is both put_old as well as new_root.
>>>>>
>>>>> When I initially suggested the change I saw "." was new_root and forgot
>>>>> "." was also put_old.  So I thought there was a silent danger without
>>>>> that sequence.
>>>>
>>>> So, now I am a little confused by the comments you added here. Do you
>>>> now mean that the
>>>>
>>>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>
>>>> call is not actually necessary?
>>
>> Apologies for being slow getting back to you.
>
> NP. Thanks for your reply.
>
>> To my knowledge there are two cases where pivot_root is used.
>> - In the initial mount namespace from a ramdisk when mounting root.
>>    This is the original use case and somewhat historical as rootfs
>>    (aka an initial ramfs) may not be unmounted.
>>
>> - When setting up a new mount namespace to jettison all of the mounts
>>    you don't need.
>>
>> The sequence:
>>
>> 	chdir(new_root);
>>          pivot_root(".", ".");
>>          umount2(".", MNT_DETACH);
>>
>> is perfect for both use cases (as nothing needs to be known about the
>> directory layout of the new root filesystem).
>>
>> In the case when you are setting up a new mount namespace propogating
>> changes in the mount layout to another mount namespace is fatal.  But
>> that is not a concern for using that pivot_root sequence above because
>> pivot_root will fail deterministically if
>> 'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.
>>
>> So I would document the above sequence of three system calls in the
>> man-page.
>
> Okay. I've changed the example to be just those three calls.
>
>> I would document that pivot_root will fail if propagation would occur.
>
> Yep. That's in the page already.
>
>> I would document in pivot_root or under unshare(CLONE_NEWNS) that if
>> mount propagation is enabled (the default with systemd) that you
>> need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
>> 'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
>> namespace.  Or mounts will propagate backwards, which is usually
>> not what people want.
>
> Thanks. Instead, I have added the following text to
> mount_namespaces(7), the page that is referred to by both clone(2) and
> unshare(2) in their discussions of CLONE_NEWNS:
>
>        An   application  that  creates  a  new  mount  namespace
>        directly using clone(2) or unshare(2) may desire to  pre‐
>        vent  propagation  of  mount events to other mount names‐
>        paces (as is is done by unshare(1)).  This can be done by
>        changing  the propagation type of mount points in the new
>        namesapace to either MS_SLAVE  or  MS_PRIVATE.   using  a
>        call such as the following:
>
>                   mount(NULL, "/", MS_SLAVE | MS_REC, NULL);

Yes.

>> Creating of a mount namespace in a user namespace automatically does
>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>> namespace was not created in that user namespace.  AKA creating
>> a mount namespace in a user namespace does the unshare for you.
>
> Oh -- I had forgotten that detail. But it is documented
> (by you, I think) in mount_namespaces(7):
>
>        *  A  mount  namespace  has  an  owner user namespace.  A
>           mount namespace whose owner user namespace is  differ‐
>           ent  from the owner user namespace of its parent mount
>           namespace is considered a less privileged mount names‐
>           pace.
>
>        *  When  creating  a  less  privileged  mount  namespace,
>           shared mounts are reduced to  slave  mounts.   (Shared
>           and  slave  mounts are discussed below.)  This ensures
>           that  mappings  performed  in  less  privileged  mount
>           namespaces will not propagate to more privileged mount
>           namespaces.
>
> There's one point that description that troubles me. There is a
> reference to "parent mount namespace", but as I understand things
> there is no parental relationship among mount namespaces instances
> (or am I wrong?). Should that wording not be rather something
> like "the mount namespace of the process that created this mount
> namespace"?

How about "the mount namespace this mount namespace started as a copy of"

You are absolutely correct there is no relationship between mount
namespaces.  There is just the propagation tree between mounts.  (Which
acts similarly to a parent/child relationship but is not at all the same
thing).

Eric