On Wed, Mar 17, 2021 at 7:45 PM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > > On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote: > > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > > > > Jan, > > > > > > > > > > > > These patches try to implement a minimal set and least controversial > > > > > > functionality that we can allow for unprivileged users as a starting > > > > > > point. > > > > > > > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > > > > patches using the unprivileged listener LTP tests written by Matthew > > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > > > > > > > Thanks. I've added both patches to my tree. > > > > > > > > Great! > > > > I'll go post the LTP tests and work on the man page updates. > > > > > > > > BTW, I noticed that you pushed the aggregating for_next branch, > > > > but not the fsnotify topic branch. > > > > > > > > Is this intentional? > > > > > > Not really, pushed now. Thanks for reminder. > > > > > > > I am asking because I am usually basing my development branches > > > > off of your fsnotify branch, but I can base them on the unpushed branch. > > > > > > > > Heads up. I am playing with extra privileges we may be able to > > > > allow an ns_capable user. > > > > For example, watching a FS_USERNS_MOUNT filesystem that the user > > > > itself has mounted inside userns. > > > > > > > > Another feature I am investigating is how to utilize the new idmapped > > > > mounts to get a subtree watch functionality. This requires attaching a > > > > userns to the group on fanotify_init(). > > > > > > > > <hand waving> > > > > If the group's userns are the same or below the idmapped mount userns, > > > > then all the objects accessed via that idmapped mount are accessible > > > > to the group's userns admin. We can use that fact to filter events very > > > > early based on their mnt_userns and the group's userns, which should be > > > > cheaper than any subtree permission checks. > > > > <\hand waving> > > > > > > Yeah, I agree this should work. Just it seems to me the userbase for this > > > functionality will be (at least currently) rather limited. While full > > > > That may change when systemd home dirs feature starts to use > > idmapped mounts. > > Being able to watch the user's entire home directory is a big win > > already. > > Hey Amir, > Hey Jan, > > I think so too. > > > > > > subtree watches would be IMO interesting to much more users. > > > > Agreed. > > We have a use-case for subtree watches: One feature for containers we > have is that users can e.g. tell us that they want the container manager > to hotplug an arbitrary unix or block device into the container whenever > the relevant device shows up on the system. For example they could > instruct the container manager to plugin some new driver device when it > shows up in /dev. That works nicely because of uevents. But users quite > often also instruct us to plugin a path once it shows up in some > directory in the filesystem hierarchy and unplug it once it is removed. > Right now we're mainting an inotify-based hand-rolled recursive watch to > make this work so we detect that add and remove event. I would be wildly > excited if we could get rid of some of that complexity by using subtree > watches. The container manager on the host will be unaffected by this > feature since it will usually have root privileges and manage > unprivileged containers. > The unprivileged (userns use-case specifically here) subtree watches > will be necessary and really good to have to make this work for > container workloads and nested containers, i.e. where the container > manager itselfs runs in a container and starts new containres. Since the > subtree feature would be interesting for systemd itself and since our > container manager (ChromeOS etc.) runs systemd inside unprivileged > containers on a large scale it would be good if subtree watches could > work in userns too. > I don't understand the subtree watch use case. You will have to walk me through it. What exactly is the container manager trying to detect? That a subdir of a specific name/path was created/deleted? It doesn't sound like a recursive watch is needed for that. What am I missing? As for nested container managers (and systemd), my thinking is that if all the mounts that manager is watching for serving its containers are idmapped to that manager's userns (is that a viable option?), then there shouldn't be a problem to setup userns filtered watches in order to be notified on all the events that happen via those idmapped mounts and filtering by "subtree" is not needed. I am clearly far from understanding the big picture. > > > > I was looking into that as well, using the example of nfsd_acceptable() > > to implement the subtree permission check. > > > > The problem here is that even if unprivileged users cannot compromise > > security, they can still cause significant CPU overhead either queueing > > events or filtering events and that is something I haven't been able to > > figure out a way to escape from. > > > > BUT, if you allow userns admin to setup subtree watches (a.k.a filtered > > filesystem marks) on a userns filesystem/idmapped mount, now users > > I think that sounds reasonable. > If the mount really is idmapped, it might be interesting to consider > checking for privilege in the mnt_userns in addition to the regular > permission checks that fanotify performs. My (equally handwavy) thinking > is that this might allow for a nice feature where the creator of the > mount (e.g. systemd) can block the creation of subtree watches by > attaching a mnt_userns to the mnt that the user has no privilege in. > (Just a thought.). > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup fanotify watches. In linux-next, unprivileged user can already setup inode watches (i.e. like inotify). So I am not sure what you are referring to by "block the creation of subtree watches". If systemd were to idmap my home dir to mnt_userns where my user has CAP_SYS_ADMIN, then allowing my user to setup a watch for all events on that mount should not be too hard. If you think that is useful and you want to play with this feature I can provide a WIP branch soon. Thanks, Amir.