"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes: > Hello Eric, > > Thank you. I was hoping you might jump in on this thread. > > Please see below. > > On 10/9/19 10:46 AM, Eric W. Biederman wrote: >> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes: >> >>> Hello Philipp, >>> >>> My apologies that it has taken a while to reply. (I had been hoping >>> and waiting that a few more people might weigh in on this thread.) >>> >>> On 9/23/19 3:42 PM, Philipp Wendler wrote: >>>> Hello Michael, >>>> >>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages): >>>> >>>>> I'm considering to rewrite these pieces to exactly >>>>> describe what the system call does (which I already >>>>> do in the third paragraph) and remove the "may or may not" >>>>> pieces in the second paragraph. I'd welcome comments >>>>> on making that change. > > What did you think about my proposal above? To put it in context, > this was my initial comment in the mail: > > [[ > One area of the page that I'm still not really happy with > is the "vague" wording in the second paragraph and the note > in the third paragraph about the system call possibly > changing. These pieces survive (in somewhat modified form) > from the original page, which was written before the > system call was released, and it seems there was some > question about whether the system call might still change > its behavior with respect to the root directory and current > working directory of other processes. However, after 19 > years, nothing has changed, and surely it will not in the > future, since that would constitute an ABI breakage. > I'm considering to rewrite these pieces to exactly > describe what the system call does (which I already > do in the third paragraph) and remove the "may or may not" > pieces in the second paragraph. I'd welcome comments > on making that change. > ]] > > And the second and third paragraphs of the manual page currently > read: > > [[ > pivot_root() may or may not change the current root and the cur‐ > rent working directory of any processes or threads that use the > old root directory and which are in the same mount namespace as > the caller of pivot_root(). The caller of pivot_root() should > ensure that processes with root or current working directory at > the old root operate correctly in either case. An easy way to > ensure this is to change their root and current working directory > to new_root before invoking pivot_root(). Note also that > pivot_root() may or may not affect the calling process's current > working directory. It is therefore recommended to call chdir("/") > immediately after pivot_root(). > > The paragraph above is intentionally vague because at the time > when pivot_root() was first implemented, it was unclear whether > its affect on other process's root and current working directo‐ > ries—and the caller's current working directory—might change in > the future. However, the behavior has remained consistent since > this system call was first implemented: pivot_root() changes the > root directory and the current working directory of each process > or thread in the same mount namespace to new_root if they point to > the old root directory. (See also NOTES.) On the other hand, > pivot_root() does not change the caller's current working direc‐ > tory (unless it is on the old root directory), and thus it should > be followed by a chdir("/") call. > ]] Apologies I saw that concern I didn't realize it was a questio I think it is very reasonable to remove warning the behavior might change. We have pivot_root(8) in common use that to use it requires the semantic of changing processes other than the current process. Which means any attempt to noticably change the behavior of pivot_root(2) will break userspace. Now the documented semantics in behavior above are not quite what pivot_root(2) does. It walks all processes on the system and if the working directory or the root directory refer to the root mount that is being replaced, then pivot_root(2) will update them. In practice the above is limited to a mount namespace. But something as simple as "cd /proc/<somepid>/root" can allow a process to have a working directory in a different mount namespace. Because ``unprivileged'' users can now use pivot_root(2) we may want to rethink the implementation at some point to be cheaper than a global process walk. So far that process walk has not been a problem in practice. If we had to write pivot_root(2) from scratch limiting it to just changing the root directory of the process that calls pivot_root(2) would have been the superior semantic. That would have required run pivot_root(8) like: "exec pivot_root . . -- /bin/bash ..." but it would not have required walking every thread in the system. >>>> I think that it would make the man page significantly easier to >>>> understand if if the vague wording and the meta discussion about it are >>>> removed. >>> >>> It is my inclination to make this change, but I'd love to get more >>> feedback on this point. >>> >>>>> DESCRIPTION >>>> [...]> pivot_root() changes the >>>>> root directory and the current working directory of each process >>>>> or thread in the same mount namespace to new_root if they point to >>>>> the old root directory. (See also NOTES.) On the other hand, >>>>> pivot_root() does not change the caller's current working direc‐ >>>>> tory (unless it is on the old root directory), and thus it should >>>>> be followed by a chdir("/") call. >>>> >>>> There is a contradiction here with the NOTES (cf. below). >>> >>> See below. >>> >>>>> The following restrictions apply: >>>>> >>>>> - new_root and put_old must be directories. >>>>> >>>>> - new_root and put_old must not be on the same filesystem as the >>>>> current root. In particular, new_root can't be "/" (but can be >>>>> a bind mounted directory on the current root filesystem). >>>> >>>> Wouldn't "must not be on the same mountpoint" or something similar be >>>> more clear, at least for new_root? The note in parentheses indicates >>>> that new_root can actually be on the same filesystem as the current >>>> note. However, ... >>> >>> For 'put_old', it really is "filesystem". >> >> If we are going to be pedantic "filesystem" is really the wrong concept >> here. The section about bind mount clarifies it, but I wonder if there >> is a better term. > > Thanks. My aim was to try to distinguish "mount point" from > "a mount somewhere inside the file system associated with a > certain mount point"--in other words, I wanted to make it clear > that 'put_old' (and 'new_root') could not be subdirectories > under the current root mount point (which is correct, right?). > > Using "mount" does seem better. (My only concern is that some > people may take it to mean "the mount point", but perhaps that > just my own confusion.) I am open to better terms. But mount or vfsmount is what we are using internal to the kernel and is really a distinct concept from filesystem. And it is starting to leak out in system calls like move_mount. >> I think I would say: "new_root and put_old must not be on the same mount >> as the current root." > > I've made that change. > >> I think using "mount" instead of "filesystem" keeps the concepts less >> confusing. >> >> As I am reading through this email and seeing text that is trying to be >> precise and clear then hitting the term "filesystem" is a bit jarring. >> pivot_root doesn't care a thing for file systems. pivot_root only cares >> about mounts. >> >> And by a "mount" I mean the thing that you get when you create a bind >> mount or you call mount normally. > > Thanks for the above comments. > > Hmm, doI need to make similar changes in the initial paragraph of > the manual page as well? It currently reads: > > pivot_root() changes the root filesystem in the mount namespace of > the calling process. More precisely, it moves the root filesystem > to the directory put_old and makes new_root the new root filesys‐ > tem. The calling process must have the CAP_SYS_ADMIN capability > in the user namespace that owns the caller's mount namespace. > > Furthermore the one line NAME of the man page reads: > > pivot_root - change the root filesystem > > Is a change needed there also? Yes please. Both locations. >> Michael do you have man pages for the new mount api yet? > > David Howells wrote pages in mid-2018, well before the syscalls got > merged in the kernel (in mid-2019). I did not merge them because > the code was not yet in the kernel, and lacking time, I never chased > David when the syscalls did get merged to see if the pages were still > up to date. I pinged David just now. Good. I was thinking of them because the concept of "mount" matters more there. >>> >>>>> - put_old must be at or underneath new_root; that is, adding a >>>>> nonnegative number of /.. to the string pointed to by put_old >>>>> must yield the same directory as new_root. >>>>> >>>>> - new_root must be a mount point. (If it is not otherwise a >>>>> mount point, it suffices to bind mount new_root on top of >>>>> itself.) >>>> >>>> ... this item actually makes the above item almost redundant regarding >>>> new_root (except for the "/") case. So one could replace this item with >>>> something like this: >>>> >>>> - new_root must be a mount point different from "/". (If it is not >>>> otherwise a mount point, it suffices to bind mount new_root on top >>>> of itself.) >>>> >>>> The above item would then only mention put_old (and maybe use clarified >>>> wording on whether actually a different file system is necessary for >>>> put_old or whether a different mount point is enough). >>> >>> Thanks. That's a good suggestion. I simplified the earlier bullet >>> point as you suggested, and changed the text here to say: >>> >>> - new_root must be a mount point, but can't be "/". If it is not >>> otherwise a mount point, it suffices to bind mount new_root on >>> top of itself. (new_root can be a bind mounted directory on >>> the current root filesystem.) >> >> How about: >> - new_root must be the path to a mount, but can't be "/". Any > > Surely here it must be "mount point" not "mount"? (See my discussion > above.) Sigh. I have had my head in the code to long, where new_root is used to refer to the mount that is mounted on that mount point as well. > >> path that is not already a mount can be converted into one by >> bind mounting the path onto itself. >>>>> NOTES >>>> [...] >>>>> pivot_root() allows the caller to switch to a new root filesystem >>>>> while at the same time placing the old root mount at a location >>>>> under new_root from where it can subsequently be unmounted. (The >>>>> fact that it moves all processes that have a root directory or >>>>> current working directory on the old root filesystem to the new >>>>> root filesystem frees the old root filesystem of users, allowing >>>>> it to be unmounted more easily.) >>>> >>>> Here is the contradiction: >>>> The DESCRIPTION says that root and current working dir are only changed >>>> "if they point to the old root directory". Here in the NOTES it says >>>> that any root or working directories on the old root file system (i.e., >>>> even if somewhere below the root) are changed. >>>> >>>> Which is correct? >>> >>> The first text is correct. I must have accidentally inserted >>> "filesystem" into the paragraph just here during a global edit. >>> Thanks for catching that. >>> >>>> If it indeed affects all processes with root and/or current working >>>> directory below the old root, the text here does not clearly state what >>>> the new root/current working directory of theses processes is. >>>> E.g., if a process is at /foo and we pivot to /bar, will the process be >>>> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt >>>> to move it to some location like /bar/foo? Because the latter might not >>>> even exist, I suspect that everything is just moved to new_root, but >>>> this could be stated explicitly by replacing "to the new root >>>> filesystem" in the above paragraph with "to the new root directory" >>>> (after checking whether this is true). >>> >>> The text here now reads: >>> >>> pivot_root() allows the caller to switch to a new root filesystem >>> while at the same time placing the old root mount at a location >>> under new_root from where it can subsequently be unmounted. (The >>> fact that it moves all processes that have a root directory or >>> current working directory on the old root directory to the new >>> root frees the old root directory of users, allowing the old root >>> filesystem to be unmounted more easily.) >> >> >> Please "mount" instead of "filesystem". > > Changed. > > >>>>> EXAMPLE> The program below demonstrates the use of pivot_root() inside a >>>>> mount namespace that is created using clone(2). After pivoting to >>>>> the root directory named in the program's first command-line argu‐ >>>>> ment, the child created by clone(2) then executes the program >>>>> named in the remaining command-line arguments. >>>> >>>> Why not use the pivot_root(".", ".") in the example program? >>>> It would make the example shorter, and also works if the process cannot >>>> write to new_root (e..g., in a user namespace). >>> >>> I'm not sure. Some people have a bit of trouble to wrap their head >>> around the pivot_root(".", ".") idea. (I possibly am one of them.) >>> I'd be quite keen to hear other opinions on this. Unfortunately, >>> few people have commented on this manual page rewrite. >> >> I am happy as long as it is pivot_root(".", ".") is documented >> somewhere. There is real code that uses it so it is not going away. >> Plus pivot_root(".", ".") is really what is desired in a lot of >> situations where the caller of pivot_root is an intermediary and >> does not control the new root filesystem. At which point the only >> path you can be guaranteed to exit on the new root filesystem is "/". > > Good. There is documentation of pivot_root(".", ".") i the page! Yeah! Eric