Re: allowing for a completely cached umount(2) pathwalk

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 14 Apr 2023 15:57:34 +0000

> On Apr 14, 2023, at 11:30, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> 
> 
>> On Apr 14, 2023, at 11:13, Christian Brauner <brauner@xxxxxxxxxx> wrote:
>> 
>> On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote:
>>> 
>>> 
>>>> On Apr 14, 2023, at 09:41, Christian Brauner <brauner@xxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
>>>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
>>>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
>>>>>> 
>>>>>>> The path name that appears in /proc/mounts is the key that must be used
>>>>>>> to find and unmount a filesystem.  When you do that "find"ing you are
>>>>>>> not looking up a name in a filesystem, you are looking up a key in the
>>>>>>> mount table.
>>>>>> 
>>>>>> No.  The path name in /proc/mounts is *NOT* a key - it's a best-effort
>>>>>> attempt to describe the mountpoint.  Pathname resolution does not work
>>>>>> in terms of "the longest prefix is found and we handle the rest within
>>>>>> that filesystem".
>>>>>> 
>>>>>>> We could, instead, create an api that is given a mount-id (first number
>>>>>>> in /proc/self/mountinfo) and unmounts that.  Then /sbin/umount could
>>>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
>>>>>>> without ever doing path name lookup in the traditional sense.
>>>>>>> 
>>>>>>> But I prefer your suggestion.  LOOKUP_MOUNTPOINT could be renamed
>>>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
>>>>>>> revalidates, at most performs simple permission checks based on cached
>>>>>>> content.
>>>>>> 
>>>>>> umount /proc/self/fd/42/barf/something
>>>>>> 
>>>>> 
>>>>> Does any of that involve talking to the server? I don't necessarily see
>>>>> a problem with doing the above. If "something" is in cache then that
>>>>> should still work.
>>>>> 
>>>>> The main idea here is that we want to avoid communicating with the
>>>>> backing store during the umount(2) pathwalk.
>>>>> 
>>>>>> Discuss.
>>>>>> 
>>>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
>>>>>> what would the right permissions be for it.
>>>>>> 
>>>>>> But please, lose the "mount table is a mapping from path prefix to filesystem"
>>>>>> notion - it really, really is not.  IIRC, there are systems that work that way,
>>>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux
>>>>>> included.
>>>>> 
>>>>> I'm not opposed to something by umount-by-mount-id either. All of this
>>>>> seems like something that should probably rely on CAP_SYS_ADMIN.
>>>> 
>>>> The permission model needs to account for the fact that mount ids are
>>>> global and as such you could in principle unmount any mount in any mount
>>>> namespace. IOW, you can circumvent lookup restrictions completely.
>>>> 
>>>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly
>>>> with no claim to solving everything:
>>>> 
>>>> may_umount_by_mnt_id(struct path *opath)
>>>> {
>>>> struct path root;
>>>> bool reachable;
>>>> 
>>>> // caller in principle able to circumvent lookup restrictions
>>>>      if (!may_cap_dac_readsearch())
>>>> return false;
>>>> 
>>>> // caller can mount in their mountns
>>>> if (!may_mount())
>>>> return false;
>>>> 
>>>> // target mount and caller in the same mountns
>>>> if (!check_mnt())
>>>> return false;
>>>> 
>>>> // caller could in principle reach mount from it's root
>>>> get_fs_root(current->fs, &root);
>>>>      reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
>>>> path_put(&root);
>>>> 
>>>> return reachable;
>>>> }
>>>> 
>>>> However, that still means that we have laxer restrictions on unmounting
>>>> by mount-id then on unmount with lookup as for lookup just having
>>>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
>>>> without custom permission handlers - we also establish that the inode
>>>> can be mapped into the caller's idmapping.
>>>> 
>>>> So that would mean that unmounting by mount-id would allow you to
>>>> unmount mounts in cases where you wouldn't with umount. That might be
>>>> fine though as that's ultimately the goal here in a way.
>>>> 
>>>> One could also see a very useful feature in this where you require
>>>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
>>>> unmounting any mount in the system by mount-id. This would obviously be
>>>> very useful for privileged service managers but I haven't thought this
>>>> Through.
>>> 
>>> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege.
>>> 
>>> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege.
>>> 
>>> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN.
>> 
>> There's a difference between unmounting directly by providing a mount id
>> and getting an O_PATH file descriptor from a mnt-id. If you can simply
>> unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH
>> in a container. Without it you likely need to require
>> capable(CAP_DAC_READ_SEARCH) aka system level privileges just like
>> open_to_handle_at() which makes this interface way less generic and
>> usable. Otherwise you'd be able to get an O_PATH fd to something that
>> you wouldn't be able to access through normal path lookup.
> 
> 
> Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it).
> 

One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx