Re: [RFC] Volatile fanotify marks

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 2 Mar 2022 20:14:29 +0200

On Wed, Mar 2, 2022 at 12:04 PM Tycho Kirchner <tychokirchner@xxxxxxx> wrote:
>
>
>
> Am 01.03.22 um 17:58 schrieb Amir Goldstein:
> > On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@xxxxxxx> wrote:
> >>
> >>
> >>
> >>>>> I wanted to get your feedback on an idea I have been playing with.
> >>>>> It started as a poor man's alternative to the old subtree watch problem.
> >>
> >>
> >>> I do agree that we should NOT add "subtree filter" functionality to fanotify
> >>> (or any other filter) and that instead, we should add support for attaching an
> >>> eBPF program that implements is_subdir().
> >>> I found this [1] convection with Tycho where you had suggested this idea.
> >>> I wonder if Tycho got to explore this path further?
> >>>
> >>> [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@xxxxxxxxxxxxxx/
> >>
> >> Hi Amir, Hi Jan,
> >> Thanks for pinging back on me. Indeed I did "explore this path further".
> >> In my project
> >> https://github.com/tycho-kirchner/shournal
> >>
> >> the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user.
> >>
> >>
> >> Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing.
> >> Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?)
> >
> > FAN_MARK_VOLATILE is not a solution for "include" filters.
> > It is a solution for "exclude" filters implemented in userspace.
> > If monitoring program gets an event and decides that its path should be excluded
> > it may set a "volatile" exclude mark on that directory that will
> > suppress further
> > events from that directory for as long as the directory inode remains
> > in inode cache.
> > After directory inode has not been accessed for a while and evicted
> > from inode cache
> > the monitoring program can get an event in that directory again and then it can
> > re-install the volatile ignore mark if it wants to.
> >
> Thanks for this explanation. Regarding few exclude-directories this sounds useful.
> However, if a whole directory-tree of filesystem events shall be excluded I guess the
> performance benefit will be rather small. A benchmark may clarify this
> ( I have some yet unpublished code ready, in case you are interested).

Code for what?

> If an efficient algorithm can be found I would rather vote for "include" dirs with unlimited depth.

As I said, this is desirable, difficult and completely orthogonal to
FAN_MARK_VOLATILE
functionality.

> Btw. similar to the process-filter approach by unshared mount namespaces about which
> I wrote in our last correspondence, you may be able to exclude your .private/ directory
> by bind-mounting over it and otherwise only marking only those mounts of interest
> instead of the entire filesystem. But yeah, this is kinda messy.
>

Marking bind mounts could be a good option for some use cases.
But unlike your monitoring app, my app needs to track create/unlink/rename as
well and those events are not currently available for mount marks.
I had several attempts to tackle that but they did not work out yet.
Partly, because marking a bind mount is not as useful as filtering by subtree.

> >> but what do you think about the following approach, in order to solve the subtree watch problem:
> >> - Store the include/exlude-paths of interest as *strings* in a hashset.
> >> - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This
> >>     can either happen with a reference on the path (clear older paths periodically in a work queue)
> >>     or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused).
> >>     The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient
> >>     subpath-lookup.
> >
> > I would implement filtering with is_subdir() and not with d_path(),
> > but there are
> > advantages to either approach.
> > In any case, I see there is BPF_FUNC_d_path, so why can't your approach be
> > implemented using an eBPF program?
> >It seems that bpf_d_path was introduced with v5.10 (6e22ab9da79343532cd3cde39df25e5a5478c692), however, shournal must still run on older kernels (e.g. openSUSE Leap v5.3.18). Further, as far as I remember, at least in Linux 4.19 there was quite some overhead to just install the fd into the eBPF user-space process, but I have to re-check that once that functionality is more widespread.
>

There is no need to install any fd.
The program should hook a function that has access to a struct path.

>
> >>
> >> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes.
> >> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify.
> >
> > We already have partial support for unprivileged fanotify.
> > Which features are you missing with unprivileged fanotify?
> > and why do you think that filtering by process tree will allow those
> > features to be enabled?
>
>
> I am missing the ability to filter for (close-)events of large directory trees in a race-free manner, so that no events are lost on newly created dirs. Even without the race, monitoring my home-directory is impossible (without privileges) as I have far more than 8192 directories (393941 as of writing (; ).
> Monitoring mounts solves these problems but introduces two others:
> First it requires privileges, second a potentially large number of events *not of interest* have to be copied to user-space (except unshared mount namespaces are used). Allowing a user to only monitor his/her own processes would make mark_mount privileges unnecessary (please correct me if I'm wrong). While still events above the directory of interest are reported, at least events from other users are filtered beforehand.
>

I don't know. Security model is hard.
What do you mean by "his/her own processes"? processes owned by the same uid?
With simple look it sounds right, but other security policy may be in
play (e.g. sepolicy)
which can grand different processes owned by same user different file access
permissions and not any process may be allowed to ptrace other processes.
userns has more clear semantics, so monitoring all processes/mounts inside
an unprivileged userns may be easier to prove.

> > A child process may well have more privileges to read directories than
> > its parent.
> >
> Similar to ptrace fanotiy should then not follow suid-programs, so this case should not occur.
>
> After all I totally understand that you do not want to feature-bloat fanotify and maybe my use-case is already too far from the one casual users have. On the other hand, pid- or path-filtering is maybe basic enough and fanotify does offer the ability to filter for paths - it is just quite limited due to the mark-concept. I think it should not be necessary in order to monitor a directory tree, to touch every single directory inside beforehand. Maybe a hybrid-solution fits best here: hard-code pid-filtering as a security feature into fanotify, allow marking of mounts for the user's own processes and allow for eBPF filter rules afterwards.
>

This is a bit too hand waving for me to understand.
In the end, it's all in the details.
Need to see a whole design and/or implementation to be able to say
what are its benefits and how doable it is.

Thanks,
Amir.