> On Apr 14, 2023, at 11:30, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > > >> On Apr 14, 2023, at 11:13, Christian Brauner <brauner@xxxxxxxxxx> wrote: >> >> On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote: >>> >>> >>>> On Apr 14, 2023, at 09:41, Christian Brauner <brauner@xxxxxxxxxx> wrote: >>>> >>>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: >>>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: >>>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: >>>>>> >>>>>>> The path name that appears in /proc/mounts is the key that must be used >>>>>>> to find and unmount a filesystem. When you do that "find"ing you are >>>>>>> not looking up a name in a filesystem, you are looking up a key in the >>>>>>> mount table. >>>>>> >>>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort >>>>>> attempt to describe the mountpoint. Pathname resolution does not work >>>>>> in terms of "the longest prefix is found and we handle the rest within >>>>>> that filesystem". >>>>>> >>>>>>> We could, instead, create an api that is given a mount-id (first number >>>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could >>>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all >>>>>>> without ever doing path name lookup in the traditional sense. >>>>>>> >>>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed >>>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never >>>>>>> revalidates, at most performs simple permission checks based on cached >>>>>>> content. >>>>>> >>>>>> umount /proc/self/fd/42/barf/something >>>>>> >>>>> >>>>> Does any of that involve talking to the server? I don't necessarily see >>>>> a problem with doing the above. If "something" is in cache then that >>>>> should still work. >>>>> >>>>> The main idea here is that we want to avoid communicating with the >>>>> backing store during the umount(2) pathwalk. >>>>> >>>>>> Discuss. >>>>>> >>>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide >>>>>> what would the right permissions be for it. >>>>>> >>>>>> But please, lose the "mount table is a mapping from path prefix to filesystem" >>>>>> notion - it really, really is not. IIRC, there are systems that work that way, >>>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux >>>>>> included. >>>>> >>>>> I'm not opposed to something by umount-by-mount-id either. All of this >>>>> seems like something that should probably rely on CAP_SYS_ADMIN. >>>> >>>> The permission model needs to account for the fact that mount ids are >>>> global and as such you could in principle unmount any mount in any mount >>>> namespace. IOW, you can circumvent lookup restrictions completely. >>>> >>>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly >>>> with no claim to solving everything: >>>> >>>> may_umount_by_mnt_id(struct path *opath) >>>> { >>>> struct path root; >>>> bool reachable; >>>> >>>> // caller in principle able to circumvent lookup restrictions >>>> if (!may_cap_dac_readsearch()) >>>> return false; >>>> >>>> // caller can mount in their mountns >>>> if (!may_mount()) >>>> return false; >>>> >>>> // target mount and caller in the same mountns >>>> if (!check_mnt()) >>>> return false; >>>> >>>> // caller could in principle reach mount from it's root >>>> get_fs_root(current->fs, &root); >>>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); >>>> path_put(&root); >>>> >>>> return reachable; >>>> } >>>> >>>> However, that still means that we have laxer restrictions on unmounting >>>> by mount-id then on unmount with lookup as for lookup just having >>>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems >>>> without custom permission handlers - we also establish that the inode >>>> can be mapped into the caller's idmapping. >>>> >>>> So that would mean that unmounting by mount-id would allow you to >>>> unmount mounts in cases where you wouldn't with umount. That might be >>>> fine though as that's ultimately the goal here in a way. >>>> >>>> One could also see a very useful feature in this where you require >>>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow >>>> unmounting any mount in the system by mount-id. This would obviously be >>>> very useful for privileged service managers but I haven't thought this >>>> Through. >>> >>> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. >>> >>> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. >>> >>> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. >> >> There's a difference between unmounting directly by providing a mount id >> and getting an O_PATH file descriptor from a mnt-id. If you can simply >> unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH >> in a container. Without it you likely need to require >> capable(CAP_DAC_READ_SEARCH) aka system level privileges just like >> open_to_handle_at() which makes this interface way less generic and >> usable. Otherwise you'd be able to get an O_PATH fd to something that >> you wouldn't be able to access through normal path lookup. > > > Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it). > One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever. _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx