Re: [RFC PATCH bpf-next 10/17] bpf: Add support to attach program to multiple trampolines

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 25 Aug 2022 19:35:44 -0700

On Thu, Aug 25, 2022 at 10:44 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Aug 25, 2022 at 9:08 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> >
> > On Tue, Aug 23, 2022 at 06:22:37PM -0700, Alexei Starovoitov wrote:
> > > On Mon, Aug 08, 2022 at 04:06:19PM +0200, Jiri Olsa wrote:
> > > > Adding support to attach program to multiple trampolines
> > > > with new attach/detach interface:
> > > >
> > > >   int bpf_trampoline_multi_attach(struct bpf_tramp_prog *tp,
> > > >                                   struct bpf_tramp_id *id)
> > > >   int bpf_trampoline_multi_detach(struct bpf_tramp_prog *tp,
> > > >                                   struct bpf_tramp_id *id)
> > > >
> > > > The program is passed as bpf_tramp_prog object and trampolines to
> > > > attach it to are passed as bpf_tramp_id object.
> > > >
> > > > The interface creates new bpf_trampoline object which is initialized
> > > > as 'multi' trampoline and stored separtely from standard trampolines.
> > > >
> > > > There are following rules how the standard and multi trampolines
> > > > go along:
> > > >   - multi trampoline can attach on top of existing single trampolines,
> > > >     which creates 2 types of function IDs:
> > > >
> > > >       1) single-IDs - functions that are attached within existing single
> > > >          trampolines
> > > >       2) multi-IDs  - functions that were 'free' and are now taken by new
> > > >          'multi' trampoline
> > > >
> > > >   - we allow overlapping of 2 'multi' trampolines if they are attached
> > > >     to same IDs
> > > >   - we do now allow any other overlapping of 2 'multi' trampolines
> > > >   - any new 'single' trampoline cannot attach to existing multi-IDs IDs.
> > > >
> > > > Maybe better explained on following example:
> > > >
> > > >    - you want to attach program P to functions A,B,C,D,E,F
> > > >      via bpf_trampoline_multi_attach
> > > >
> > > >    - D,E,F already have standard trampoline attached
> > > >
> > > >    - the bpf_trampoline_multi_attach will create new 'multi' trampoline
> > > >      which spans over A,B,C functions and attach program P to single
> > > >      trampolines D,E,F
> > > >
> > > >    - A,B,C functions are now 'not attachable' by any trampoline
> > > >      until the above 'multi' trampoline is released
> > >
> > > This restriction is probably too severe.
> > > Song added support for trampoline and KLPs to co-exist on the same function.
> > > This multi trampoline restriction will resurface the same issue.
> > > afiak this restriction is only because multi trampoline image
> > > is the same for A,B,C. This memory optimization is probably going too far.
> > > How about we keep existing logic of one tramp image per function.
> > > Pretend that multi-prog P matches BTF of the target function,
> > > create normal tramp for it and attach prog P there.
> > > The prototype of P allows six u64. The args are potentially rearding
> > > garbage, but there are no safety issues, since multi progs don't know BTF types.
> > >
> > > We still need sinle bpf_link_multi to contain btf_ids of all functions,
> > > but it can point to many bpf tramps. One for each attach function.
> > >
> > > iirc we discussed something like this long ago, but I don't remember
> > > why we didn't go that route.
> > > arch_prepare_bpf_trampoline is fast.
> > > bpf_tramp_image_alloc is fast too.
> > > So attaching one multi-prog to thousands of btf_id-s should be fast too.
> > > The destroy part is interesting.
> > > There we will be doing thousands of bpf_tramp_image_put,
> > > but it's all async now. We used to have synchronize_rcu() which could
> > > be the reason why this approach was slow.
> > > Or is this unregister_fentry that slows it down?
> > > But register_ftrace_direct_multi() interface should have solved it
> > > for both register and unregister?
> >
> > I think it's the synchronize_rcu_tasks at the end of each ftrace update,
> > that's why we added un/register_ftrace_direct_multi that makes the changes
> > for multiple ips and syncs once at the end
>
> hmm. Can synchronize_rcu_tasks be made optional?
> For ftrace_direct that points to bpf tramps is it really needed?
>
> > un/register_ftrace_direct_multi will attach/detach multiple multiple ips
> > to single address (trampoline), so for this approach we would need to add new
> > ftrace direct api that would allow to set multiple ips to multiple trampolines
> > within one call..
>
> right
>
> > I was already checking on that and looks doable
>
> awesome.
>
> > another problem might be that this update function will need to be called with
> > all related trampoline locks, which in this case would be thousands
>
> sure. but these will be newly allocated trampolines and
> brand new mutexes, so no contention.
> But thousands of cmpxchg-s will take time. Would be good to measure
> though. It might not be that bad.

What about the memory overhead of thousands of trampolines and
trampoline images? Seems very wasteful to create one per each attach,
when each attachment in general will be identical.

If I remember correctly, last time we were also discussing creating a
generic BPF trampoline that would save all 6 input registers,
regardless of function's BTF signature. Such BPF trampoline should
support calling both generic fentry/fexit programs and typed ones,
because all the necessary data is stored on the stack correctly.

For the case when typed (non-generic) BPF trampoline is already
attached to a function and now we are attaching generic fentry, why
can't we "upgrade" existing BPF trampoline to become generic, and then
just add generic multi-fentry program to its trampoline image? Once
that multi-fentry is detached, we might choose to convert trampoline
back to typed BPF trampoline (i.e., save only necessary registers, not
all 6 of them), but that's more like an optimization, it doesn't have
to happen.

Or is there something that would make such generic trampoline impossible?

If we go with this approach, then each multi-fentry attachment will be
creating minimum amount of trampolines, determined by all the
combinations of attached programs at that point. If after we attach
multi-fentry to some set of functions we need to attach another
multi-fentry or typed fentry, we'd potentially need to split
trampolines and create a bit more of them. But while that sounds a bit
complicated, we do all that under locks so there isn't much problem in
doing that, no?

But in general, I agree with Alexei that this restriction on not being
able to attach to a function once multi-attach trampoline is attached
to it is a really-really bad restriction in production, where we can't
control exactly what BPF apps run and in which order.

P.S. I think this generic typeless BPF trampoline is a useful thing in
itself and we are half-way there already with bpf_get_func_ip() and
bpf_get_func_arg_cnt() helpers and storing such "parameters" on the
stack, so tbh, we can probably split the problem into two and try to
address a somewhat simpler and more straightforward generic BPF
trampoline first. Such generic type-less BPF trampoline will make
fentry a better and more generic alternative to kprobe, by being less
demanding about specifying BTF ID (even if we don't care about input
argument types) yet faster to trigger than kprobe.