On Mon, May 17, 2021 at 12:09 PM Jan Kara <jack@xxxxxxx> wrote: > > On Sat 15-05-21 17:28:27, Amir Goldstein wrote: > > On Fri, May 14, 2021 at 4:56 PM Christian Brauner > > <christian.brauner@xxxxxxxxxx> wrote: > > > > for changes with idmap-filtered mark, then it won't see notification for > > > > those changes because A presumably runs in a different namespace than B, am > > > > I imagining this right? So mark which filters events based on namespace of > > > > the originating process won't be usable for such usecase AFAICT. > > > > > > Idmap filtered marks won't cover that use-case as envisioned now. Though > > > I'm not sure they really need to as the semantics are related to mount > > > marks. > > > > We really need to refer to those as filesystem marks. They are definitely > > NOT mount marks. We are trying to design a better API that will not share > > as many flaws with mount marks... > > I agree. I was pondering about this usecase exactly because the problem with > changes done through mount A and visible through mount B which didn't get > a notification were source of complaints about fanotify in the past and the > reason why you came up with filesystem marks. > > > > A mount mark would allow you to receive events based on the > > > originating mount. If two mounts A and B are separate but expose the > > > same files you wouldn't see events caused by B if you're watching A. > > > Similarly you would only see events from mounts that have been delegated > > > to you through the idmapped userns. I find this acceptable especially if > > > clearly documented. > > > > > > > The way I see it, we should delegate all the decisions over to userspace, > > but I agree that the current "simple" proposal may not provide a good > > enough answer to the case of a subtree that is shared with the host. > > > > IMO, it should be a container manager decision whether changes done by > > the host are: > > a) Not visible to containerized application > > b) Watched in host via recursive inode watches > > c) Watched in host by filesystem mark filtered in userspace > > d) Watched in host by an "noop" idmapped mount in host, through > > which all relevant apps in host access the shared folder > > > > We can later provide the option of "subtree filtered filesystem mark" > > which can be choice (e). It will incur performance overhead on the system > > that is higher than option (d) but lower than option (c). > > But won't b) and c) require the container manager to inject events into the > event stream observed by the containerized fanotify user? Because in both > these cases the manager needs to consume generated events and decide what > to do with them. > With (b) manager does not need to inject events. The manager intercepts fanotify_init() and returns an actual fantify group fd in the requesting process fd table. Later, when manager intercepts fanotify_mark() with idmapped mark request, manager can take care of setting up the recursive inode watches, but the requesting process will get the events, because it has a clone of the fanotify group fd. With (c), I guess the intercepted fanotify_init() can return an open pipe and proxy the stream of events read from the actual fanotify fd filtering out the events. I hope we can provide some form of kernel subtree filtering so userspace will not need to resort to this sort of practice. Thanks, Amir.