Re: [RFC] Volatile fanotify marks

Tycho Kirchner <tychokirchner@xxxxxxx> · Wed, 4 May 2022 12:01:49 +0200

Am 04.05.22 um 08:13 schrieb Amir Goldstein:
On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@xxxxxxx> wrote:

All right, I thought a bit more about that and returned to your
original BPF idea you mentioned on 2020-08-28:

I was thinking that we could add a BPF hook to fanotify_handle_event()
(similar to what's happening in packet filtering code) and you could attach
BPF programs to this hook to do filtering of events. That way we don't have
to introduce new group flags for various filtering options. The question is
whether eBPF is strong enough so that filters useful for fanotify users
could be implemented with it but this particular check seems implementable.

                                                               Honza

Instead of changing fanotify's filesystem notification functionality,
I suggest to rather **add a tracing mode (fantrace)**.

The synchronous handling of syscalls via ptrace is of course required
for debugging purposes, however that introduces a major slowdown (even
with seccomp-bpf filters). There are a number of cases, including
[1-3], where async processing of file events of specific tasks would be
fine but is not readily available in Linux. Fanotify already ships
important infrastructure in this regard: it provides very fast
event-buffering and, by using file descriptors instead of resolved
paths, a clean and race-free API to process the events later. However,
as already stated, fanotify does not provide a clean way, to monitor
only a subset of tasks. Therefore please consider the following
proposed architecture of fantrace:

Each taks gets its own struct fsnotify_group. Within
fsnotify.c:fsnotify() it is checked if the given task has a
fsnotify_group attached where events of interest are buffered as usual.
Note that this is an additional hook - sysadmins being subscribed to
filesystem events rather than task-filesystem-events are notified as
usual - in that case two hooks possibly run. The fsnotify_group is
extended by a field optionally pointing to a BPF program which allows
for custom filters to be run.

Some implementation details:
- To let the tracee return quickly, run BPF filter program within tracer
    context during read(fan_fd) but before events are copied to userspace
- only one fantracer per task, which overrides existing ones if any
- task->fsnotify_group refcount increment on fork, decrement on exit (run
    after exit_files(tsk) to not miss final close events). When last task
    exited, send EOF to listener.
- on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
- lazy check when event occurs, if listener is still alive (refcount > 1)
- for the beginning, to keep things simple and to "solve" the cleanup of
    filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
    (only allow FAN_MARK_FILESYSTEM), as that functionality can be
    implemented within the user-provided BPF-program.

Maybe I am slow, but I did not understand the need for this task fsnotify_group.

What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
that hook is supposed to filter by pid so why all this extra complexity?

We may consider the option to have another BFP hook when reading
events if there is
good justification, but subtree filters will have to be in handle_event().

Thanks,
Amir.

To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip)
file-events from all paths have to be reported, which is difficult
using i_fsnotify_marks, because
- marking whole mountpoints requires privileges
- marking the whole filesystem using directory marks is unfeasible

However, we need a quick way to find out, if a file event is of
interest (find its beloning fsnotify_group). For the purpose of tracing
it appears reasonable to consider all file events of a traced task as
"interesting" in the first place. So, in this way, we allow a user to
trace file events of his own tasks without slowing down other,
non-traced tasks.

After all, it's all about the order of running filters - first inode,
then pid or reverse. With my proposed architecture for the purpose of
tracing I would hand the inode-filter to the user in form of an
optional BPF hook. Performance-wise that's also the "fair" solution.
Let's assume we allow marking the whole filesystem (via mountpoints).
Now, the BPF-pid-filter code has to run for every single file event (of
all users!), if multiple users trace the filesystem even multiple hooks
have to run, slowing down the whole system.

Thanks,
Tycho