On Fri, Apr 14, 2023 at 03:30:30PM +0000, Trond Myklebust wrote: > > > > On Apr 14, 2023, at 11:13, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote: > >> > >> > >>> On Apr 14, 2023, at 09:41, Christian Brauner <brauner@xxxxxxxxxx> wrote: > >>> > >>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: > >>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: > >>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > >>>>> > >>>>>> The path name that appears in /proc/mounts is the key that must be used > >>>>>> to find and unmount a filesystem. When you do that "find"ing you are > >>>>>> not looking up a name in a filesystem, you are looking up a key in the > >>>>>> mount table. > >>>>> > >>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort > >>>>> attempt to describe the mountpoint. Pathname resolution does not work > >>>>> in terms of "the longest prefix is found and we handle the rest within > >>>>> that filesystem". > >>>>> > >>>>>> We could, instead, create an api that is given a mount-id (first number > >>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could > >>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all > >>>>>> without ever doing path name lookup in the traditional sense. > >>>>>> > >>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed > >>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never > >>>>>> revalidates, at most performs simple permission checks based on cached > >>>>>> content. > >>>>> > >>>>> umount /proc/self/fd/42/barf/something > >>>>> > >>>> > >>>> Does any of that involve talking to the server? I don't necessarily see > >>>> a problem with doing the above. If "something" is in cache then that > >>>> should still work. > >>>> > >>>> The main idea here is that we want to avoid communicating with the > >>>> backing store during the umount(2) pathwalk. > >>>> > >>>>> Discuss. > >>>>> > >>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide > >>>>> what would the right permissions be for it. > >>>>> > >>>>> But please, lose the "mount table is a mapping from path prefix to filesystem" > >>>>> notion - it really, really is not. IIRC, there are systems that work that way, > >>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux > >>>>> included. > >>>> > >>>> I'm not opposed to something by umount-by-mount-id either. All of this > >>>> seems like something that should probably rely on CAP_SYS_ADMIN. > >>> > >>> The permission model needs to account for the fact that mount ids are > >>> global and as such you could in principle unmount any mount in any mount > >>> namespace. IOW, you can circumvent lookup restrictions completely. > >>> > >>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly > >>> with no claim to solving everything: > >>> > >>> may_umount_by_mnt_id(struct path *opath) > >>> { > >>> struct path root; > >>> bool reachable; > >>> > >>> // caller in principle able to circumvent lookup restrictions > >>> if (!may_cap_dac_readsearch()) > >>> return false; > >>> > >>> // caller can mount in their mountns > >>> if (!may_mount()) > >>> return false; > >>> > >>> // target mount and caller in the same mountns > >>> if (!check_mnt()) > >>> return false; > >>> > >>> // caller could in principle reach mount from it's root > >>> get_fs_root(current->fs, &root); > >>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); > >>> path_put(&root); > >>> > >>> return reachable; > >>> } > >>> > >>> However, that still means that we have laxer restrictions on unmounting > >>> by mount-id then on unmount with lookup as for lookup just having > >>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems > >>> without custom permission handlers - we also establish that the inode > >>> can be mapped into the caller's idmapping. > >>> > >>> So that would mean that unmounting by mount-id would allow you to > >>> unmount mounts in cases where you wouldn't with umount. That might be > >>> fine though as that's ultimately the goal here in a way. > >>> > >>> One could also see a very useful feature in this where you require > >>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow > >>> unmounting any mount in the system by mount-id. This would obviously be > >>> very useful for privileged service managers but I haven't thought this > >>> Through. > >> > >> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. > >> > >> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. > >> > >> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. > > > > There's a difference between unmounting directly by providing a mount id > > and getting an O_PATH file descriptor from a mnt-id. If you can simply > > unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH > > in a container. Without it you likely need to require > > capable(CAP_DAC_READ_SEARCH) aka system level privileges just like > > open_to_handle_at() which makes this interface way less generic and > > usable. Otherwise you'd be able to get an O_PATH fd to something that > > you wouldn't be able to access through normal path lookup. > > > Being able to convert into an O_PATH descriptor gives you more options > than just unmounting. It should allow you to syncfs() before > unmounting. It should allow you to call open_tree() so you can > manipulate the filesystem that is no longer accessible by path walk > (e.g. so you can bind it elsewhere or move it). I'm not saying it's wrong. I'm just saying there are trade-offs. Sure this is useful but it'll need to be a pretty privileged api which might be fine.