Re: [RFC][PATCH] fanotify: introduce filesystem view mark

Christian Brauner <christian.brauner@xxxxxxxxxx> · Wed, 5 May 2021 15:25:57 +0200

On Mon, May 03, 2021 at 06:53:15PM +0200, Jan Kara wrote:
> On Wed 28-04-21 21:28:18, Amir Goldstein wrote:
> > On Thu, Nov 26, 2020 at 1:17 PM Jan Kara <jack@xxxxxxx> wrote:
> > > On Thu 26-11-20 05:42:01, Amir Goldstein wrote:
> > > > On Wed, Nov 25, 2020 at 1:01 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > On Tue 24-11-20 16:47:41, Amir Goldstein wrote:
> > > > > > On Tue, Nov 24, 2020 at 3:49 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > On Mon 09-11-20 20:00:16, Amir Goldstein wrote:
> > > > > > > > A filesystem view is a subtree of a filesystem accessible from a specific
> > > > > > > > mount point.  When marking an FS view, user expects to get events on all
> > > > > > > > inodes that are accessible from the marked mount, even if the events
> > > > > > > > were generated from another mount.
> > > > > > > >
> > > > > > > > In particular, the events such as FAN_CREATE, FAN_MOVE, FAN_DELETE that
> > > > > > > > are not delivered to a mount mark can be delivered to an FS view mark.
> > > > > > > >
> > > > > > > > One example of a filesystem view is btrfs subvolume, which cannot be
> > > > > > > > marked with a regular filesystem mark.
> > > > > > > >
> > > > > > > > Another example of a filesystem view is a bind mount, not on the root of
> > > > > > > > the filesystem, such as the bind mounts used for containers.
> > > > > > > >
> > > > > > > > A filesystem view mark is composed of a heads sb mark and an sb_view mark.
> > > > > > > > The filesystem view mark is connected to the head sb mark and the head
> > > > > > > > sb mark is connected to the sb object. The mask of the head sb mask is
> > > > > > > > a cumulative mask of all the associated sb_view mark masks.
> > > > > > > >
> > > > > > > > Filesystem view marks cannot co-exist with a regular filesystem mark on
> > > > > > > > the same filesystem.
> > > > > > > >
> > > > > > > > When an event is generated on the head sb mark, fsnotify iterates the
> > > > > > > > list of associated sb_view marks and filter events that happen outside
> > > > > > > > of the sb_view mount's root.
> > > > > > > >
> > > > > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > > > > > >
> > > > > > > I gave this just a high-level look (no detailed review) and here are my
> > > > > > > thoughts:
> > > > > > >
> > > > > > > 1) I like the functionality. IMO this is what a lot of people really want
> > > > > > > when looking for "filesystem wide fs monitoring".
> > > > > > >
> > > > > > > 2) I don't quite like the API you propose though. IMO it exposes details of
> > > > > > > implementation in the API. I'd rather like to have API the same as for
> > > > > > > mount marks but with a dedicated mark type flag in the API - like
> > > > > > > FAN_MARK_FILESYSTEM_SUBTREE (or we can keep VIEW if you like it but I think
> > > > > > > the less terms the better ;).
> > > > > >
> > > > > > Sure, FAN_MARK_FS_VIEW is a dedicated mark type.
> > > > > > The fact that is it a bitwise OR of MOUNT and FILESYSTEM is just a fun fact.
> > > > > > Sorry if that wasn't clear.
> > > > > > FAN_MARK_FILESYSTEM_SUBTREE sounds better for uapi.
> > > > > >
> > > > > > But I suppose you also meant that we should not limit the subtree root
> > > > > > to bind mount points?
> > > > > >
> > > > > > The reason I used a reference to mnt for a sb_view and not dentry
> > > > > > is because we have fsnotify_clear_marks_by_mount() callback to
> > > > > > handle cleanup of the sb_view marks (which I left as TODO).
> > > > > >
> > > > > > Alternatively, we can play cache pinning games with the subtree root dentry
> > > > > > like the case with inode mark, but I didn't want to get into that nor did I know
> > > > > > if we should - if subtree mark requires CAP_SYS_ADMIN anyway, why not
> > > > > > require a bind mount as its target, which is something much more visible to
> > > > > > admins.
> > > > >
> > > > > Yeah, I don't have problems with bind mounts in particular. Just I was
> > > > > thinking that concievably we could make these marks less priviledged (just
> > > > > with CAP_DAC_SEARCH or so) and then mountpoints may be unnecessarily
> > > > > restricting. I don't think pinning of subtree root dentry would be
> > > > > problematic as such - inode marks pin the inode anyway, this is not
> > > > > substantially different - if we can make it work reliably...
> > > > >
> > > > > In fact I was considering for a while that we could even make subtree
> > > > > watches completely unpriviledged - when we walk the dir tree anyway, we
> > > > > could also check permissions along the way. Due to locking this would be
> > > > > difficult to do when generating the event but it might be actually doable
> > > > > if we perform the permission check when reporting the event to userspace.
> > > > > Just a food for thought...
> > > > >
> > > >
> > > > I think unprivileged subtree watches are something nice for the future, but
> > > > for these FS_VIEW (or whatnot) marks, there is a lower hanging opportunity -
> > > > make them require privileges relative to userns.
> > >
> > > Agreed, that's a middle step.
> > >
> > > > We don't need to relax that right from the start and it may requires some
> > > > more work, but it could allow  unprivileged container user to set a
> > > > filesystem-like watch on a filesystem where user is privileged relative
> > > > to s_user_ns and that is a big win already.
> > >
> > > Yep, I'd prefer to separate these two problems. I.e., first handle the
> > > subtree watches on their own (just keeping in mind we might want to make
> > > them less priviledged eventually), when that it working, we can look in all
> > > the implications of making fanotify accessible to less priviledged tasks.
> > >
> > > > It may also be possible in the future to allow setting this mark on a
> > > > "unserns contained" mount - I'm not exactly sure of the details of idmapped
> > > > mounts [1], but if mount has a userns associated with it to map fs uids then
> > > > in theory we can check the view-ability of the event either at event read time
> > > > or at event generation time - it requires that all ancestors have uid/gid that
> > > > are *mapped* to the mount userns and nothing else, because we know
> > > > that the listener process has CAP_DAC_SEARCH (or more) in the target
> > > > userns.
> > >
> > > Event read is *much* simpler for permission checks IMO. First due to
> > > locking necessary for permission checks (i_rwsem, xattr locks etc.), second
> > > so that you don't have to mess with credentials used for checking.
> > >
> > 
> > Jan,
> > 
> > I've lost track of all the "subtree mark" related threads ;-)
> 
> Yeah, me as well :)
> 
> > Getting back to this old thread, because the "fs view" concept that
> > it presented is very close to two POCs I tried out recently which leverage
> > the availability of mnt_userns in most of the call sites for fsnotify hooks.
> > 
> > The first POC was replacing the is_subtree() check with in_userns()
> > which is far less expensive:
> > 
> > https://github.com/amir73il/linux/commits/fanotify_in_userns
> > 
> > This approach reduces the cost of check per mark, but there could
> > still be a significant number of sb marks to iterate for every fs op
> > in every container.
> > 
> > The second POC is based off the first POC but takes the reverse
> > approach - instead of marking the sb object and filtering by userns,
> > it places a mark on the userns object and filters by sb:
> > 
> > https://github.com/amir73il/linux/commits/fanotify_idmapped
> > 
> > The common use case is a single host filesystem which is
> > idmapped via individual userns objects to many containers,
> > so normally, fs operations inside containers would have to
> > iterate a single mark.
> > 
> > I am well aware of your comments about trying to implement full
> > blown subtree marks (up this very thread), but the userns-sb
> > join approach is so much more low hanging than full blown
> > subtree marks. And as a by-product, it very naturally provides
> > the correct capability checks so users inside containers are
> > able to "watch their world".
> > 
> > Patches to allow resolving file handles inside userns with the
> > needed permission checks are also available on the POC branch,
> > which makes the solution a lot more useful.
> > 
> > In that last POC, I introduced an explicit uapi flag
> > FAN_MARK_IDMAPPED in combination with
> > FAN_MARK_FILESYSTEM it provides the new capability.
> > This is equivalent to a new mark type, it was just an aesthetic
> > decision.
> 
> So in principle, I have no problem with allowing mount marks for ns-capable
> processes. Also FAN_MARK_FILESYSTEM marks filtered by originating namespace
> look OK to me (although if we extended mount marks to support directory
> events as you try elsewhere, would there be still be a compeling usecase for
> this?).
> 
> My main concern is creating a sane API so that if we expand the
> functionality in the future we won't create a mess out of all
> possibilities.

Yeah, this is most certainly the main challenge here.

> 
> So I think there are two, relatively orthogonal decicions to make:
> 
> 1) How the API should look like? For mounts there's no question I guess.
> It's a mount mark as any other and we just relax the permission checks.
> For FAN_MARK_FILESYSTEM marks we have to me more careful - I think
> restricting mark to events generated only from a particular userns has to
> be an explicit flag when adding the mark. Otherwise process that is
> CAP_SYS_ADMIN in init_user_ns has no way of using these ns-filtered marks.
> But this is also the reason why I'd like to think twice before adding this
> event filtering if we can cover similar usecases by expanding mount marks
> capabilities instead (it would certainly better fit overall API design).
> 
> 2) Whether to internally attach marks to sb or to userns and how to
> efficiently process them when generating events. This is an internal
> decision of fsnotify and so I'm not concerned too much about it. We can
> always tweak it in the future if the usecases show the CPU overhead is
> significant. E.g. we could attach filtered marks to sb but hash it by
> userns (or have rbtree ordered by userns in sb) to lower the CPU overhead
> if there will be many sb marks expected. Attaching to userns as you suggest
> in POC2 is certainly an option as well although I guess I sligthly prefer
> to keep things in the sb so that we don't have to create yet another place
> to attach marks to and all the handling associated with that.

I would need to look at Amir's implementation again but if there's a way
to keep all the state we need for this within the sb that would be good.
I agree with that. It feels like the correct place.

Christian