Re: [RFC/PATCH v2 bpf-next fanotify 7/7] selftests/bpf: Add test for BPF based fanotify fastpath handler

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 19 Nov 2024 08:59:30 +0100

On Tue, Nov 19, 2024 at 2:10 AM Song Liu <songliubraving@xxxxxxxx> wrote:
>
> Hi Alexei,
>
> > On Nov 18, 2024, at 4:10 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> [...]
> >>>
> >>> Agreed. This is actually something I have been thinking
> >>> since the beginning of this work: Shall it be fanotify-bpf
> >>> or fsnotify-bpf. Given we have more materials, this is a
> >>> good time to have broader discussions on this.
> >>>
> >>> @all, please chime in whether we should redo this as
> >>> fsnotify-bpf. AFAICT:
> >>>
> >>> Pros of fanotify-bpf:
> >>> - There is existing user space that we can leverage/reuse.
> >>>
> >>> Pros of fsnotify-bpf:
> >>> - Faster fast path.
> >>>
> >>> Another major pros/cons did I miss?

Sorry, I forgot to address this earlier.

First and foremost, I don't like the brand name "fast path" for this
feature.  The word "fast" implies that avoiding an indirect call is
meaningful and I don't think this is the case here.

I would rather rebrand this feature as "fanotify filter program".
I am going to make a controversial argument:
I anticipate that in the more common use of kernel filter for fanotify
the filter will be *likely* to skip sending events to userspace,
for example, when watching a small subtree in a large filesystem.

In that case, the *likely* result is that a context switch to userspace
is avoided, hence, the addition of one indirect call is insignificant
in comparison.

If we later decide that my assumption is wrong and it is important
to optimize out the indirect call when skipping userspace, we could
do that - it is not a problem, but for now, this discussion seems
like premature optimization to me.

> >>
> >> Adding more thoughts on this: I think it makes more sense to
> >> go with fanotify-bpf. This is because one of the benefits of
> >> fsnotify/fanotify over LSM solutions is the built-in event
> >> filtering of events. While this call chain is a bit long:
> >>
> >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
> >>
> >> There are built-in filtering in fsnotify() and
> >> send_to_group(), so logics in the call chain are useful.
> >
> > fsnotify_marks based filtering happens in fsnotify.
> > No need to do more indirect calls to get to fanotify.
> >
> > I would add the bpf struct_ops hook right before send_to_group
> > or inside of it.
> > Not sure whether fsnotify_group concept should be reused
> > or avoided.
> > Per inode mark/mask filter should stay.
>
> We still need fsnotify_group. It matches each fanotify
> file descriptor.
>

We need the filter to be per fsnotify_group.
Unlike LSM, fsnotify/fanotify is a pubsub service for many
fs event consumers (a.k.a groups).
The feature is a filter per event consumer, not a global filter
for all event consumers.

> Moving struct_ops hook inside send_to_group does save
> us an indirect call. But this also means we need to
> introduce the fastpath concept to both fsnotify and
> fanotify. I personally don't really like duplications
> like this (see the big BUILD_BUG_ON array in
> fanotify_handle_event).
>
> OTOH, maybe the benefit of one fewer indirect call
> justifies the extra complexity. Let me think more
> about it.
>

I need to explain something about fsnotify vs. fanotify
in order to argue why the feature should be "fanotify", but the
bottom line is that is should not be too hard to avoid the indirect
call even if the feature is introduced through fanotify API as I think
that it should be.

TLDR:
The fsnotify_backend abstraction has become somewhat
of a theater of abstraction over time, because the feature
distance between fanotify backend and all the rest has grew
quite large.

The logic in send_to_group() is *seemingly* the generic fsnotify
logic, but not really, because only fanotify has ignore masks
and only fanotify has mark types (inode,mount,sb).

This difference is encoded by the group->ops->handle_event()
operation that only fanotify implements.
All the rest of the backends implement the simpler ->handle_inode_event().

Similarly, the group->private union is always dominated by the size
of group->fanotify_data, so there is no big difference if we place
group->fp_hook (or ->filter_hook) outside of fanotify_data, so that
we can query and call it from send_to_group() saving the indirect call
to ->handle_event().

That still leaves the question if we need to call fanotify_group_event_mask()
before the filter hook.

fanotify_group_event_mask() does several things, but I think that
the only thing relevant before the filter hook is this line:
                /*
                 * Send the event depending on event flags in mark mask.
                 */
                if (!fsnotify_mask_applicable(mark->mask, ondir, type))
                        continue;

This code is related to the two "built-in fanotify filters", namely
FAN_ONDIR and FAN_EVENT_ON_CHILD.
These built-in filters are so lame that they serve as a good example
why a programmable filter is a better idea.
For example, users need to opt-in for events on directories, but they
cannot request events only on directories.

Historically, the "generic" abstraction in send_to_group() has dealt
with the non-generic fanotify ignore mask, but has not dealt with
these non-generic fanotify built-in filters.

However, since commit 31a371e419c8 ("fanotify: prepare for setting
event flags in ignore mask"), send_to_group() is already aware of those
fanotify built-in filters.

So unless I am missing something, if we align the marks iteration
loop in send_to_group() to look exactly like the marks iteration loop in
fanotify_group_event_mask(), there should be no problem to call
the filter hook directly, right before calling ->handle_event().

Hope this brain dump helps.

Thanks,
Amir.