Re: [RFC PATCH bpf-next 10/17] bpf: Add support to attach program to multiple trampolines

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 26 Aug 2022 22:15:40 -0700

On Fri, Aug 26, 2022 at 7:20 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
>
> On Thu, Aug 25, 2022 at 07:35:44PM -0700, Andrii Nakryiko wrote:
> > On Thu, Aug 25, 2022 at 10:44 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Thu, Aug 25, 2022 at 9:08 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Aug 23, 2022 at 06:22:37PM -0700, Alexei Starovoitov wrote:
> > > > > On Mon, Aug 08, 2022 at 04:06:19PM +0200, Jiri Olsa wrote:
> > > > > > Adding support to attach program to multiple trampolines
> > > > > > with new attach/detach interface:
> > > > > >
> > > > > >   int bpf_trampoline_multi_attach(struct bpf_tramp_prog *tp,
> > > > > >                                   struct bpf_tramp_id *id)
> > > > > >   int bpf_trampoline_multi_detach(struct bpf_tramp_prog *tp,
> > > > > >                                   struct bpf_tramp_id *id)
> > > > > >
> > > > > > The program is passed as bpf_tramp_prog object and trampolines to
> > > > > > attach it to are passed as bpf_tramp_id object.
> > > > > >
> > > > > > The interface creates new bpf_trampoline object which is initialized
> > > > > > as 'multi' trampoline and stored separtely from standard trampolines.
> > > > > >
> > > > > > There are following rules how the standard and multi trampolines
> > > > > > go along:
> > > > > >   - multi trampoline can attach on top of existing single trampolines,
> > > > > >     which creates 2 types of function IDs:
> > > > > >
> > > > > >       1) single-IDs - functions that are attached within existing single
> > > > > >          trampolines
> > > > > >       2) multi-IDs  - functions that were 'free' and are now taken by new
> > > > > >          'multi' trampoline
> > > > > >
> > > > > >   - we allow overlapping of 2 'multi' trampolines if they are attached
> > > > > >     to same IDs
> > > > > >   - we do now allow any other overlapping of 2 'multi' trampolines
> > > > > >   - any new 'single' trampoline cannot attach to existing multi-IDs IDs.
> > > > > >
> > > > > > Maybe better explained on following example:
> > > > > >
> > > > > >    - you want to attach program P to functions A,B,C,D,E,F
> > > > > >      via bpf_trampoline_multi_attach
> > > > > >
> > > > > >    - D,E,F already have standard trampoline attached
> > > > > >
> > > > > >    - the bpf_trampoline_multi_attach will create new 'multi' trampoline
> > > > > >      which spans over A,B,C functions and attach program P to single
> > > > > >      trampolines D,E,F
> > > > > >
> > > > > >    - A,B,C functions are now 'not attachable' by any trampoline
> > > > > >      until the above 'multi' trampoline is released
> > > > >
> > > > > This restriction is probably too severe.
> > > > > Song added support for trampoline and KLPs to co-exist on the same function.
> > > > > This multi trampoline restriction will resurface the same issue.
> > > > > afiak this restriction is only because multi trampoline image
> > > > > is the same for A,B,C. This memory optimization is probably going too far.
> > > > > How about we keep existing logic of one tramp image per function.
> > > > > Pretend that multi-prog P matches BTF of the target function,
> > > > > create normal tramp for it and attach prog P there.
> > > > > The prototype of P allows six u64. The args are potentially rearding
> > > > > garbage, but there are no safety issues, since multi progs don't know BTF types.
> > > > >
> > > > > We still need sinle bpf_link_multi to contain btf_ids of all functions,
> > > > > but it can point to many bpf tramps. One for each attach function.
> > > > >
> > > > > iirc we discussed something like this long ago, but I don't remember
> > > > > why we didn't go that route.
> > > > > arch_prepare_bpf_trampoline is fast.
> > > > > bpf_tramp_image_alloc is fast too.
> > > > > So attaching one multi-prog to thousands of btf_id-s should be fast too.
> > > > > The destroy part is interesting.
> > > > > There we will be doing thousands of bpf_tramp_image_put,
> > > > > but it's all async now. We used to have synchronize_rcu() which could
> > > > > be the reason why this approach was slow.
> > > > > Or is this unregister_fentry that slows it down?
> > > > > But register_ftrace_direct_multi() interface should have solved it
> > > > > for both register and unregister?
> > > >
> > > > I think it's the synchronize_rcu_tasks at the end of each ftrace update,
> > > > that's why we added un/register_ftrace_direct_multi that makes the changes
> > > > for multiple ips and syncs once at the end
> > >
> > > hmm. Can synchronize_rcu_tasks be made optional?
> > > For ftrace_direct that points to bpf tramps is it really needed?
> > >
> > > > un/register_ftrace_direct_multi will attach/detach multiple multiple ips
> > > > to single address (trampoline), so for this approach we would need to add new
> > > > ftrace direct api that would allow to set multiple ips to multiple trampolines
> > > > within one call..
> > >
> > > right
> > >
> > > > I was already checking on that and looks doable
> > >
> > > awesome.
> > >
> > > > another problem might be that this update function will need to be called with
> > > > all related trampoline locks, which in this case would be thousands
> > >
> > > sure. but these will be newly allocated trampolines and
> > > brand new mutexes, so no contention.
> > > But thousands of cmpxchg-s will take time. Would be good to measure
> > > though. It might not be that bad.
> >
> > What about the memory overhead of thousands of trampolines and
> > trampoline images? Seems very wasteful to create one per each attach,
> > when each attachment in general will be identical.
> >
> >
> > If I remember correctly, last time we were also discussing creating a
> > generic BPF trampoline that would save all 6 input registers,
> > regardless of function's BTF signature. Such BPF trampoline should
> > support calling both generic fentry/fexit programs and typed ones,
> > because all the necessary data is stored on the stack correctly.
> >
> > For the case when typed (non-generic) BPF trampoline is already
> > attached to a function and now we are attaching generic fentry, why
> > can't we "upgrade" existing BPF trampoline to become generic, and then
> > just add generic multi-fentry program to its trampoline image? Once
> > that multi-fentry is detached, we might choose to convert trampoline
> > back to typed BPF trampoline (i.e., save only necessary registers, not
> > all 6 of them), but that's more like an optimization, it doesn't have
> > to happen.
> >
> > Or is there something that would make such generic trampoline impossible?
> >
> > If we go with this approach, then each multi-fentry attachment will be
> > creating minimum amount of trampolines, determined by all the
> > combinations of attached programs at that point. If after we attach
> > multi-fentry to some set of functions we need to attach another
> > multi-fentry or typed fentry, we'd potentially need to split
> > trampolines and create a bit more of them. But while that sounds a bit
> > complicated, we do all that under locks so there isn't much problem in
> > doing that, no?
> >
> > But in general, I agree with Alexei that this restriction on not being
> > able to attach to a function once multi-attach trampoline is attached
> > to it is a really-really bad restriction in production, where we can't
> > control exactly what BPF apps run and in which order.
>
> ah ok.. attaching single trampoline on top of attached multi trampoline
> should be possible to add.. as long as one side of the problem is single
> trampoline it should be doable, I'll check
>
> leaving the restriction only to attaching one multi trampoline over
> another (not equal) attached multi trampoline
>
> would that be acceptable?

I guess I'm missing what's fundamentally different between
multi-trampoline + single trampoline vs multi-tramp + multi-tramp?
Multi-tramp is already saving all registers, so can "host" other
generic fentry/fexit. So why this multi + multi restriction?

>
> >
> > P.S. I think this generic typeless BPF trampoline is a useful thing in
> > itself and we are half-way there already with bpf_get_func_ip() and
> > bpf_get_func_arg_cnt() helpers and storing such "parameters" on the
> > stack, so tbh, we can probably split the problem into two and try to
> > address a somewhat simpler and more straightforward generic BPF
> > trampoline first. Such generic type-less BPF trampoline will make
> > fentry a better and more generic alternative to kprobe, by being less
> > demanding about specifying BTF ID (even if we don't care about input
> > argument types) yet faster to trigger than kprobe.
>
> yes, with the help of those helpers the only 'generic' thing for
> trampoline is its BTF type
>
> jirka