Re: For review: rewritten pivot_root(2) manual page

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Wed, 09 Oct 2019 11:00:53 -0500

"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

> Hello Eric,
>
> Thank you. I was hoping you might jump in on this thread.
>
> Please see below.
>
> On 10/9/19 10:46 AM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>> 
>>> Hello Philipp,
>>>
>>> My apologies that it has taken a while to reply. (I had been hoping
>>> and waiting that a few more people might weigh in on this thread.)
>>>
>>> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>>>> Hello Michael,
>>>>
>>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>>>
>>>>> I'm considering to rewrite these pieces to exactly
>>>>> describe what the system call does (which I already
>>>>> do in the third paragraph) and remove the "may or may not"
>>>>> pieces in the second paragraph. I'd welcome comments
>>>>> on making that change.
>
> What did you think about my proposal above? To put it in context,
> this was my initial comment in the mail:
>
> [[
> One area of the page that I'm still not really happy with
> is the "vague" wording in the second paragraph and the note
> in the third paragraph about the system call possibly
> changing. These pieces survive (in somewhat modified form)
> from the original page, which was written before the
> system call was released, and it seems there was some
> question about whether the system call might still change
> its behavior with respect to the root directory and current
> working directory of other processes. However, after 19
> years, nothing has changed, and surely it will not in the
> future, since that would constitute an ABI breakage.
> I'm considering to rewrite these pieces to exactly
> describe what the system call does (which I already
> do in the third paragraph) and remove the "may or may not"
> pieces in the second paragraph. I'd welcome comments
> on making that change.
> ]]
>
> And the second and third paragraphs of the manual page currently
> read:
>
> [[
>        pivot_root()  may  or may not change the current root and the cur‐
>        rent working directory of any processes or threads  that  use  the
>        old  root  directory  and which are in the same mount namespace as
>        the caller of pivot_root().  The  caller  of  pivot_root()  should
>        ensure  that  processes  with root or current working directory at
>        the old root operate correctly in either case.   An  easy  way  to
>        ensure  this is to change their root and current working directory
>        to  new_root  before  invoking  pivot_root().   Note   also   that
>        pivot_root()  may  or may not affect the calling process's current
>        working directory.  It is therefore recommended to call chdir("/")
>        immediately after pivot_root().
>
>        The  paragraph  above  is  intentionally vague because at the time
>        when pivot_root() was first implemented, it  was  unclear  whether
>        its  affect  on  other process's root and current working directo‐
>        ries—and the caller's current working  directory—might  change  in
>        the  future.   However, the behavior has remained consistent since
>        this system call was first implemented: pivot_root()  changes  the
>        root  directory  and the current working directory of each process
>        or thread in the same mount namespace to new_root if they point to
>        the  old  root  directory.   (See also NOTES.)  On the other hand,
>        pivot_root() does not change the caller's current  working  direc‐
>        tory  (unless it is on the old root directory), and thus it should
>        be followed by a chdir("/") call.
> ]]

Apologies I saw that concern I didn't realize it was a questio

I think it is very reasonable to remove warning the behavior might
change.  We have pivot_root(8) in common use that to use it requires
the semantic of changing processes other than the current process.
Which means any attempt to noticably change the behavior of
pivot_root(2) will break userspace.

Now the documented semantics in behavior above are not quite what
pivot_root(2) does.  It walks all processes on the system and if the
working directory or the root directory refer to the root mount that is
being replaced, then pivot_root(2) will update them.

In practice the above is limited to a mount namespace.  But something as
simple as "cd /proc/<somepid>/root" can allow a process to have a
working directory in a different mount namespace.

Because ``unprivileged'' users can now use pivot_root(2) we may want to
rethink the implementation at some point to be cheaper than a global
process walk.  So far that process walk has not been a problem in
practice.

If we had to write pivot_root(2) from scratch limiting it to just
changing the root directory of the process that calls pivot_root(2)
would have been the superior semantic.  That would have required run
pivot_root(8) like: "exec pivot_root . . -- /bin/bash ..."  but it would
not have required walking every thread in the system.

>>>> I think that it would make the man page significantly easier to
>>>> understand if if the vague wording and the meta discussion about it are
>>>> removed.
>>>
>>> It is my inclination to make this change, but I'd love to get more
>>> feedback on this point.
>>>
>>>>> DESCRIPTION
>>>> [...]>        pivot_root()  changes  the
>>>>>        root  directory  and the current working directory of each process
>>>>>        or thread in the same mount namespace to new_root if they point to
>>>>>        the  old  root  directory.   (See also NOTES.)  On the other hand,
>>>>>        pivot_root() does not change the caller's current  working  direc‐
>>>>>        tory  (unless it is on the old root directory), and thus it should
>>>>>        be followed by a chdir("/") call.
>>>>
>>>> There is a contradiction here with the NOTES (cf. below).
>>>
>>> See below.
>>>
>>>>>        The following restrictions apply:
>>>>>
>>>>>        -  new_root and put_old must be directories.
>>>>>
>>>>>        -  new_root and put_old must not be on the same filesystem as  the
>>>>>           current root.  In particular, new_root can't be "/" (but can be
>>>>>           a bind mounted directory on the current root filesystem).
>>>>
>>>> Wouldn't "must not be on the same mountpoint" or something similar be
>>>> more clear, at least for new_root? The note in parentheses indicates
>>>> that new_root can actually be on the same filesystem as the current
>>>> note. However, ...
>>>
>>> For 'put_old', it really is "filesystem".
>> 
>> If we are going to be pedantic "filesystem" is really the wrong concept
>> here.  The section about bind mount clarifies it, but I wonder if there
>> is a better term.
>
> Thanks. My aim was to try to distinguish "mount point" from
> "a mount somewhere inside the file system associated with a
> certain mount point"--in other words, I wanted to make it clear
> that 'put_old' (and 'new_root') could not be subdirectories
> under the current root mount point (which is correct, right?).
>
> Using "mount" does seem better. (My only concern is that some
> people may take it to mean "the mount point", but perhaps that
> just my own confusion.)

I am open to better terms.  But mount or vfsmount is what we are using
internal to the kernel and is really a distinct concept from filesystem.
And it is starting to leak out in system calls like move_mount.

>> I think I would say: "new_root and put_old must not be on the same mount
>> as the current root."
>
> I've made that change.
>
>> I think using "mount" instead of "filesystem" keeps the concepts less
>> confusing.
>> 
>> As I am reading through this email and seeing text that is trying to be
>> precise and clear then hitting the term "filesystem" is a bit jarring.
>> pivot_root doesn't care a thing for file systems.  pivot_root only cares
>> about mounts.
>> 
>> And by a "mount" I mean the thing that you get when you create a bind
>> mount or you call mount normally.
>
> Thanks for the above comments.
>
> Hmm, doI need to make similar changes in the initial paragraph of
> the manual page as well? It currently reads:
>
>        pivot_root() changes the root filesystem in the mount namespace of
>        the calling process.  More precisely, it moves the root filesystem
>        to  the directory put_old and makes new_root the new root filesys‐
>        tem.  The calling process must have the  CAP_SYS_ADMIN  capability
>        in the user namespace that owns the caller's mount namespace.
>
> Furthermore the one line NAME of the man page reads:
>
>        pivot_root - change the root filesystem
>
> Is a change needed there also?

Yes please.  Both locations.

>> Michael do you have man pages for the new mount api yet?
>
> David Howells wrote pages in mid-2018, well before the syscalls got
> merged in the kernel (in mid-2019). I did not merge them because
> the code was not yet in the kernel, and lacking time, I never chased
> David when the syscalls did get merged to see if the pages were still
> up to date. I pinged David just now.

Good.  I was thinking of them because the concept of "mount" matters more
there.

>>>
>>>>>        -  put_old must be at or underneath new_root; that  is,  adding  a
>>>>>           nonnegative  number  of /.. to the string pointed to by put_old
>>>>>           must yield the same directory as new_root.
>>>>>
>>>>>        -  new_root must be a mount point.  (If  it  is  not  otherwise  a
>>>>>           mount  point,  it  suffices  to  bind  mount new_root on top of
>>>>>           itself.)
>>>>
>>>> ... this item actually makes the above item almost redundant regarding
>>>> new_root (except for the "/") case. So one could replace this item with
>>>> something like this:
>>>>
>>>> - new_root must be a mount point different from "/". (If it is not
>>>>   otherwise a mount point, it suffices  to bind mount new_root on top
>>>>   of itself.)
>>>>
>>>> The above item would then only mention put_old (and maybe use clarified
>>>> wording on whether actually a different file system is necessary for
>>>> put_old or whether a different mount point is enough).
>>>
>>> Thanks. That's a good suggestion. I simplified the earlier bullet
>>> point as you suggested, and changed the text here to say:
>>>
>>>        -  new_root must be a mount point, but can't be "/".  If it is not
>>>           otherwise  a mount point, it suffices to bind mount new_root on
>>>           top of itself.  (new_root can be a bind  mounted  directory  on
>>>           the current root filesystem.)
>> 
>> How about:
>>           - new_root must be the path to a mount, but can't be "/".  Any
>
> Surely here it must be "mount point" not "mount"? (See my discussion
> above.)

Sigh.  I have had my head in the code to long, where new_root is
used to refer to the mount that is mounted on that mount point as well.

>
>>           path that is not already a mount can be converted into one by
>>           bind mounting the path onto itself.
>>>>> NOTES
>>>> [...]
>>>>>        pivot_root() allows the caller to switch to a new root  filesystem
>>>>>        while  at  the  same time placing the old root mount at a location
>>>>>        under new_root from where it can subsequently be unmounted.   (The
>>>>>        fact  that  it  moves  all processes that have a root directory or
>>>>>        current working directory on the old root filesystem  to  the  new
>>>>>        root  filesystem  frees the old root filesystem of users, allowing
>>>>>        it to be unmounted more easily.)
>>>>
>>>> Here is the contradiction:
>>>> The DESCRIPTION says that root and current working dir are only changed
>>>> "if they point to the old root directory". Here in the NOTES it says
>>>> that any root or working directories on the old root file system (i.e.,
>>>> even if somewhere below the root) are changed.
>>>>
>>>> Which is correct?
>>>
>>> The first text is correct. I must have accidentally inserted
>>> "filesystem" into the paragraph just here during a global edit.
>>> Thanks for catching that.
>>>
>>>> If it indeed affects all processes with root and/or current working
>>>> directory below the old root, the text here does not clearly state what
>>>> the new root/current working directory of theses processes is.
>>>> E.g., if a process is at /foo and we pivot to /bar, will the process be
>>>> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
>>>> to move it to some location like /bar/foo? Because the latter might not
>>>> even exist, I suspect that everything is just moved to new_root, but
>>>> this could be stated explicitly by replacing "to the new root
>>>> filesystem" in the above paragraph with "to the new root directory"
>>>> (after checking whether this is true).
>>>
>>> The text here now reads:
>>>
>>>        pivot_root() allows the caller to switch to a new root  filesystem
>>>        while  at  the  same time placing the old root mount at a location
>>>        under new_root from where it can subsequently be unmounted.   (The
>>>        fact  that  it  moves  all processes that have a root directory or
>>>        current working directory on the old root  directory  to  the  new
>>>        root  frees the old root directory of users, allowing the old root
>>>        filesystem to be unmounted more easily.)
>> 
>> 
>> Please "mount" instead of "filesystem".
>
> Changed.
>
>
>>>>> EXAMPLE>        The program below demonstrates the use of  pivot_root()  inside  a
>>>>>        mount namespace that is created using clone(2).  After pivoting to
>>>>>        the root directory named in the program's first command-line argu‐
>>>>>        ment,  the  child  created  by  clone(2) then executes the program
>>>>>        named in the remaining command-line arguments.
>>>>
>>>> Why not use the pivot_root(".", ".") in the example program?
>>>> It would make the example shorter, and also works if the process cannot
>>>> write to new_root (e..g., in a user namespace).
>>>
>>> I'm not sure. Some people have a bit of trouble to wrap their head
>>> around the pivot_root(".", ".") idea. (I possibly am one of them.)
>>> I'd be quite keen to hear other opinions on this. Unfortunately,
>>> few people have commented on this manual page rewrite.
>> 
>> I am happy as long as it is pivot_root(".", ".") is documented
>> somewhere.  There is real code that uses it so it is not going away.
>> Plus pivot_root(".", ".") is really what is desired in a lot of
>> situations where the caller of pivot_root is an intermediary and
>> does not control the new root filesystem.  At which point the only
>> path you can be guaranteed to exit on the new root filesystem is "/".
>
> Good. There is documentation of pivot_root(".", ".") i the page!

Yeah!

Eric