On Fri, Aug 26, 2022 at 7:20 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > On Thu, Aug 25, 2022 at 07:35:44PM -0700, Andrii Nakryiko wrote: > > On Thu, Aug 25, 2022 at 10:44 AM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > On Thu, Aug 25, 2022 at 9:08 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > On Tue, Aug 23, 2022 at 06:22:37PM -0700, Alexei Starovoitov wrote: > > > > > On Mon, Aug 08, 2022 at 04:06:19PM +0200, Jiri Olsa wrote: > > > > > > Adding support to attach program to multiple trampolines > > > > > > with new attach/detach interface: > > > > > > > > > > > > int bpf_trampoline_multi_attach(struct bpf_tramp_prog *tp, > > > > > > struct bpf_tramp_id *id) > > > > > > int bpf_trampoline_multi_detach(struct bpf_tramp_prog *tp, > > > > > > struct bpf_tramp_id *id) > > > > > > > > > > > > The program is passed as bpf_tramp_prog object and trampolines to > > > > > > attach it to are passed as bpf_tramp_id object. > > > > > > > > > > > > The interface creates new bpf_trampoline object which is initialized > > > > > > as 'multi' trampoline and stored separtely from standard trampolines. > > > > > > > > > > > > There are following rules how the standard and multi trampolines > > > > > > go along: > > > > > > - multi trampoline can attach on top of existing single trampolines, > > > > > > which creates 2 types of function IDs: > > > > > > > > > > > > 1) single-IDs - functions that are attached within existing single > > > > > > trampolines > > > > > > 2) multi-IDs - functions that were 'free' and are now taken by new > > > > > > 'multi' trampoline > > > > > > > > > > > > - we allow overlapping of 2 'multi' trampolines if they are attached > > > > > > to same IDs > > > > > > - we do now allow any other overlapping of 2 'multi' trampolines > > > > > > - any new 'single' trampoline cannot attach to existing multi-IDs IDs. > > > > > > > > > > > > Maybe better explained on following example: > > > > > > > > > > > > - you want to attach program P to functions A,B,C,D,E,F > > > > > > via bpf_trampoline_multi_attach > > > > > > > > > > > > - D,E,F already have standard trampoline attached > > > > > > > > > > > > - the bpf_trampoline_multi_attach will create new 'multi' trampoline > > > > > > which spans over A,B,C functions and attach program P to single > > > > > > trampolines D,E,F > > > > > > > > > > > > - A,B,C functions are now 'not attachable' by any trampoline > > > > > > until the above 'multi' trampoline is released > > > > > > > > > > This restriction is probably too severe. > > > > > Song added support for trampoline and KLPs to co-exist on the same function. > > > > > This multi trampoline restriction will resurface the same issue. > > > > > afiak this restriction is only because multi trampoline image > > > > > is the same for A,B,C. This memory optimization is probably going too far. > > > > > How about we keep existing logic of one tramp image per function. > > > > > Pretend that multi-prog P matches BTF of the target function, > > > > > create normal tramp for it and attach prog P there. > > > > > The prototype of P allows six u64. The args are potentially rearding > > > > > garbage, but there are no safety issues, since multi progs don't know BTF types. > > > > > > > > > > We still need sinle bpf_link_multi to contain btf_ids of all functions, > > > > > but it can point to many bpf tramps. One for each attach function. > > > > > > > > > > iirc we discussed something like this long ago, but I don't remember > > > > > why we didn't go that route. > > > > > arch_prepare_bpf_trampoline is fast. > > > > > bpf_tramp_image_alloc is fast too. > > > > > So attaching one multi-prog to thousands of btf_id-s should be fast too. > > > > > The destroy part is interesting. > > > > > There we will be doing thousands of bpf_tramp_image_put, > > > > > but it's all async now. We used to have synchronize_rcu() which could > > > > > be the reason why this approach was slow. > > > > > Or is this unregister_fentry that slows it down? > > > > > But register_ftrace_direct_multi() interface should have solved it > > > > > for both register and unregister? > > > > > > > > I think it's the synchronize_rcu_tasks at the end of each ftrace update, > > > > that's why we added un/register_ftrace_direct_multi that makes the changes > > > > for multiple ips and syncs once at the end > > > > > > hmm. Can synchronize_rcu_tasks be made optional? > > > For ftrace_direct that points to bpf tramps is it really needed? > > > > > > > un/register_ftrace_direct_multi will attach/detach multiple multiple ips > > > > to single address (trampoline), so for this approach we would need to add new > > > > ftrace direct api that would allow to set multiple ips to multiple trampolines > > > > within one call.. > > > > > > right > > > > > > > I was already checking on that and looks doable > > > > > > awesome. > > > > > > > another problem might be that this update function will need to be called with > > > > all related trampoline locks, which in this case would be thousands > > > > > > sure. but these will be newly allocated trampolines and > > > brand new mutexes, so no contention. > > > But thousands of cmpxchg-s will take time. Would be good to measure > > > though. It might not be that bad. > > > > What about the memory overhead of thousands of trampolines and > > trampoline images? Seems very wasteful to create one per each attach, > > when each attachment in general will be identical. > > > > > > If I remember correctly, last time we were also discussing creating a > > generic BPF trampoline that would save all 6 input registers, > > regardless of function's BTF signature. Such BPF trampoline should > > support calling both generic fentry/fexit programs and typed ones, > > because all the necessary data is stored on the stack correctly. > > > > For the case when typed (non-generic) BPF trampoline is already > > attached to a function and now we are attaching generic fentry, why > > can't we "upgrade" existing BPF trampoline to become generic, and then > > just add generic multi-fentry program to its trampoline image? Once > > that multi-fentry is detached, we might choose to convert trampoline > > back to typed BPF trampoline (i.e., save only necessary registers, not > > all 6 of them), but that's more like an optimization, it doesn't have > > to happen. > > > > Or is there something that would make such generic trampoline impossible? > > > > If we go with this approach, then each multi-fentry attachment will be > > creating minimum amount of trampolines, determined by all the > > combinations of attached programs at that point. If after we attach > > multi-fentry to some set of functions we need to attach another > > multi-fentry or typed fentry, we'd potentially need to split > > trampolines and create a bit more of them. But while that sounds a bit > > complicated, we do all that under locks so there isn't much problem in > > doing that, no? > > > > But in general, I agree with Alexei that this restriction on not being > > able to attach to a function once multi-attach trampoline is attached > > to it is a really-really bad restriction in production, where we can't > > control exactly what BPF apps run and in which order. > > ah ok.. attaching single trampoline on top of attached multi trampoline > should be possible to add.. as long as one side of the problem is single > trampoline it should be doable, I'll check > > leaving the restriction only to attaching one multi trampoline over > another (not equal) attached multi trampoline > > would that be acceptable? I guess I'm missing what's fundamentally different between multi-trampoline + single trampoline vs multi-tramp + multi-tramp? Multi-tramp is already saving all registers, so can "host" other generic fentry/fexit. So why this multi + multi restriction? > > > > > P.S. I think this generic typeless BPF trampoline is a useful thing in > > itself and we are half-way there already with bpf_get_func_ip() and > > bpf_get_func_arg_cnt() helpers and storing such "parameters" on the > > stack, so tbh, we can probably split the problem into two and try to > > address a somewhat simpler and more straightforward generic BPF > > trampoline first. Such generic type-less BPF trampoline will make > > fentry a better and more generic alternative to kprobe, by being less > > demanding about specifying BTF ID (even if we don't care about input > > argument types) yet faster to trigger than kprobe. > > yes, with the help of those helpers the only 'generic' thing for > trampoline is its BTF type > > jirka