On Tue, Oct 15, 2024 at 4:01 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Sun, Oct 13, 2024 at 06:34:18PM +0200, Amir Goldstein wrote: > > On Fri, May 24, 2024 at 2:35 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > On Fri, May 24, 2024 at 1:19 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > > > > A current limitation of open_by_handle_at() is that it's currently not possible > > > > to use it from within containers at all because we require CAP_DAC_READ_SEARCH > > > > in the initial namespace. That's unfortunate because there are scenarios where > > > > using open_by_handle_at() from within containers. > > > > > > > > Two examples: > > > > > > > > (1) cgroupfs allows to encode cgroups to file handles and reopen them with > > > > open_by_handle_at(). > > > > (2) Fanotify allows placing filesystem watches they currently aren't usable in > > > > containers because the returned file handles cannot be used. > > > > > > > > Christian, > > > > Follow up question: > > Now that open_by_handle_at(2) is supported from non-root userns, > > What about this old patch to allow sb/mount watches from non-root userns? > > https://lore.kernel.org/linux-fsdevel/20230416060722.1912831-1-amir73il@xxxxxxxxx/ > > > > Is it useful for any of your use cases? > > Should I push it forward? > > Dammit, I answered that message already yesterday but somehow it didn't > get sent or lost in some other way. > > I personally don't have a use-case for it but the systemd folks might > and it would be best to just rope them in. Lennart, I must have asked this question before, but enough time has passed so I am going to ask it again. Now that Christian has added support for open_by_handle_at(2) by non-root userns admin, it is a very low hanging fruit to support fanotify sb/mount watches inside userns with this simple patch [1], that was last posted in 2011. My question is whether this is useful, because there are still a few limitations. I will start with what is possible with this patch: 1. Watch an entire tmpfs filesystem that was mounted inside userns 2. Watch an entire overlayfs filesystem that was mounted [*] inside userns 3. Watch an entire mount [**] of any [***] filesystem that was idmapped mounted into userns Now the the fine prints: [*] Overlayfs sb/mount CAN be watched, but decoding file handle in events to path only works if overlayfs is mounted with mount option nfs_export=on, which conflicts with mount option metacopy=on, which is often used in containers (e.g. podman) [**] Watching a mount is only possible with the legacy set of fanotify events (i.e. open,close,access,modify) so this is less useful for directory tree change tracking [***] Watching an idmapped mount has the same limitations as watching an sb/mount in the root userns, namely, filesystem needs to have a non zero fsid (so not FUSE) and filesystem needs to have a uniform fsid (so not btrfs subvolume), although with some stretch, I could make watching an idmapped mount of btrfs subvol work. No support for watching btrfs subvol and overlayfs with metacopy=on, reduces the attractiveness for containers, but perhaps there are still use cases where watching an idmapped mount or userns private tmpfs are useful? To try out this patch inside your favorite container/userns, you can build fsnotifywait with a patch to support watching inside userns [2]. It's actually only the one lines O_DIRECTORY patch that is needed for the basic tmpfs userns mount case. Jan, If we do not get any buy-in from potential consumers now, do you think that we should go through with the patch and advertise the new supported use cases, so that users may come later on? Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fanotify_userns/ [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns/