On Fri, May 24, 2024 at 2:35 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > On Fri, May 24, 2024 at 1:19 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > A current limitation of open_by_handle_at() is that it's currently not possible > > to use it from within containers at all because we require CAP_DAC_READ_SEARCH > > in the initial namespace. That's unfortunate because there are scenarios where > > using open_by_handle_at() from within containers. > > > > Two examples: > > > > (1) cgroupfs allows to encode cgroups to file handles and reopen them with > > open_by_handle_at(). > > (2) Fanotify allows placing filesystem watches they currently aren't usable in > > containers because the returned file handles cannot be used. > > Christian, Follow up question: Now that open_by_handle_at(2) is supported from non-root userns, What about this old patch to allow sb/mount watches from non-root userns? https://lore.kernel.org/linux-fsdevel/20230416060722.1912831-1-amir73il@xxxxxxxxx/ Is it useful for any of your use cases? Should I push it forward? I have rebased and sanity tested it: https://github.com/amir73il/linux/commits/fanotify_userns/ But I have not written any userns fanotify test yet. If you are interested, please let me know. If you can test the patch for your use case even better. BTW, the prep patch that I refactored per Jan's review request is almost identical to your patch from 2019: https://lore.kernel.org/linux-fsdevel/20190522163150.16849-1-christian@xxxxxxxxxx/ But with additional BUILD_BUG_ON and a commit message to explain it. > > Here's a proposal for relaxing the permission check for open_by_handle_at(). > > > > (1) Opening file handles when the caller has privileges over the filesystem > > (1.1) The caller has an unobstructed view of the filesystem. > > (1.2) The caller has permissions to follow a path to the file handle. > > > > This doesn't address the problem of opening a file handle when only a portion > > of a filesystem is exposed as is common in containers by e.g., bind-mounting a > > subtree. The proposal to solve this use-case is: > > > > (2) Opening file handles when the caller has privileges over a subtree > > (2.1) The caller is able to reach the file from the provided mount fd. > > (2.2) The caller has permissions to construct an unobstructed path to the > > file handle. > > (2.3) The caller has permissions to follow a path to the file handle. > > > > The relaxed permission checks are currently restricted to directory file > > handles which are what both cgroupfs and fanotify need. Handling disconnected > > non-directory file handles would lead to a potentially non-deterministic api. > > If a disconnected non-directory file handle is provided we may fail to decode > > a valid path that we could use for permission checking. That in itself isn't a > > problem as we would just return EACCES in that case. However, confusion may > > arise if a non-disconnected dentry ends up in the cache later and those opening > > the file handle would suddenly succeed. > > > > * It's potentially possible to use timing information (side-channel) to infer > > whether a given inode exists. I don't think that's particularly > > problematic. Thanks to Jann for bringing this to my attention. > > > > * An unrelated note (IOW, these are thoughts that apply to > > open_by_handle_at() generically and are unrelated to the changes here): > > Jann pointed out that we should verify whether deleted files could > > potentially be reopened through open_by_handle_at(). I don't think that's > > possible though. > > AFAICT, the only thing that currently makes this impossible in this patch > is the O_DIRECTORY limitation. > Because there could be non-directory non-hashed aliases of deleted > files in dcache. > This limitation, that you added to my request, has implications on fanotify (TLDR): file handles reported in fanotify events extra info of type FAN_EVENT_INFO_TYPE_DFID{,_NAME} due to fanotify_inide() flag FAN_REPORT_DIR_FID are eligible for open_by_handle_at(2) inside non-root userns and FAN_REPORT_DFID_NAME is indeed the most expected use case - namely directory file handle can always be used to find the parent's path and the name of the child entry is provided in the event. HOWEVER, with fanotify_init() flags FAN_REPORT_{TARGET_,}FID the child file handle is reported additionally, using extra info type FAN_EVENT_INFO_TYPE_FID and this file handle may not be eligible for open_by_handle_at(2) inside non-root userns, because it could be resolved to a disconnected dentry, for which permission cannot be checked. I don't think this is a problem - the child fid is supposed to be used as an extra check to verify with name_to_handle_at() that the parent dir and child name refer to the same object as the one reported with the child fid. But it can be a bit confusing to users that some file handles (FAN_EVENT_INFO_TYPE_DFID) can be decoded and some (FAN_EVENT_INFO_TYPE_FID) cannot, so this hairy detail will need to be documented. In retrospect, for unpriv fanotify, we should not have allowed the basic mode FAN_REPORT_FID, which does not make a distinction between directory and non-dir file handles. Jan, do you think we should deny the FAN_REPORT_FID mode with non-root userns sb/mount watches to reduce some unneeded rows in the test matrix? Any other thoughts regarding non-root userns sb/mount watches? Thanks, Amir.