On Fri, Aug 26, 2022 at 10:15:40PM -0700, Andrii Nakryiko wrote: > On Fri, Aug 26, 2022 at 7:20 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > On Thu, Aug 25, 2022 at 07:35:44PM -0700, Andrii Nakryiko wrote: > > > On Thu, Aug 25, 2022 at 10:44 AM Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > On Thu, Aug 25, 2022 at 9:08 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > > > On Tue, Aug 23, 2022 at 06:22:37PM -0700, Alexei Starovoitov wrote: > > > > > > On Mon, Aug 08, 2022 at 04:06:19PM +0200, Jiri Olsa wrote: > > > > > > > Adding support to attach program to multiple trampolines > > > > > > > with new attach/detach interface: > > > > > > > > > > > > > > int bpf_trampoline_multi_attach(struct bpf_tramp_prog *tp, > > > > > > > struct bpf_tramp_id *id) > > > > > > > int bpf_trampoline_multi_detach(struct bpf_tramp_prog *tp, > > > > > > > struct bpf_tramp_id *id) > > > > > > > > > > > > > > The program is passed as bpf_tramp_prog object and trampolines to > > > > > > > attach it to are passed as bpf_tramp_id object. > > > > > > > > > > > > > > The interface creates new bpf_trampoline object which is initialized > > > > > > > as 'multi' trampoline and stored separtely from standard trampolines. > > > > > > > > > > > > > > There are following rules how the standard and multi trampolines > > > > > > > go along: > > > > > > > - multi trampoline can attach on top of existing single trampolines, > > > > > > > which creates 2 types of function IDs: > > > > > > > > > > > > > > 1) single-IDs - functions that are attached within existing single > > > > > > > trampolines > > > > > > > 2) multi-IDs - functions that were 'free' and are now taken by new > > > > > > > 'multi' trampoline > > > > > > > > > > > > > > - we allow overlapping of 2 'multi' trampolines if they are attached > > > > > > > to same IDs > > > > > > > - we do now allow any other overlapping of 2 'multi' trampolines > > > > > > > - any new 'single' trampoline cannot attach to existing multi-IDs IDs. > > > > > > > > > > > > > > Maybe better explained on following example: > > > > > > > > > > > > > > - you want to attach program P to functions A,B,C,D,E,F > > > > > > > via bpf_trampoline_multi_attach > > > > > > > > > > > > > > - D,E,F already have standard trampoline attached > > > > > > > > > > > > > > - the bpf_trampoline_multi_attach will create new 'multi' trampoline > > > > > > > which spans over A,B,C functions and attach program P to single > > > > > > > trampolines D,E,F > > > > > > > > > > > > > > - A,B,C functions are now 'not attachable' by any trampoline > > > > > > > until the above 'multi' trampoline is released > > > > > > > > > > > > This restriction is probably too severe. > > > > > > Song added support for trampoline and KLPs to co-exist on the same function. > > > > > > This multi trampoline restriction will resurface the same issue. > > > > > > afiak this restriction is only because multi trampoline image > > > > > > is the same for A,B,C. This memory optimization is probably going too far. > > > > > > How about we keep existing logic of one tramp image per function. > > > > > > Pretend that multi-prog P matches BTF of the target function, > > > > > > create normal tramp for it and attach prog P there. > > > > > > The prototype of P allows six u64. The args are potentially rearding > > > > > > garbage, but there are no safety issues, since multi progs don't know BTF types. > > > > > > > > > > > > We still need sinle bpf_link_multi to contain btf_ids of all functions, > > > > > > but it can point to many bpf tramps. One for each attach function. > > > > > > > > > > > > iirc we discussed something like this long ago, but I don't remember > > > > > > why we didn't go that route. > > > > > > arch_prepare_bpf_trampoline is fast. > > > > > > bpf_tramp_image_alloc is fast too. > > > > > > So attaching one multi-prog to thousands of btf_id-s should be fast too. > > > > > > The destroy part is interesting. > > > > > > There we will be doing thousands of bpf_tramp_image_put, > > > > > > but it's all async now. We used to have synchronize_rcu() which could > > > > > > be the reason why this approach was slow. > > > > > > Or is this unregister_fentry that slows it down? > > > > > > But register_ftrace_direct_multi() interface should have solved it > > > > > > for both register and unregister? > > > > > > > > > > I think it's the synchronize_rcu_tasks at the end of each ftrace update, > > > > > that's why we added un/register_ftrace_direct_multi that makes the changes > > > > > for multiple ips and syncs once at the end > > > > > > > > hmm. Can synchronize_rcu_tasks be made optional? > > > > For ftrace_direct that points to bpf tramps is it really needed? > > > > > > > > > un/register_ftrace_direct_multi will attach/detach multiple multiple ips > > > > > to single address (trampoline), so for this approach we would need to add new > > > > > ftrace direct api that would allow to set multiple ips to multiple trampolines > > > > > within one call.. > > > > > > > > right > > > > > > > > > I was already checking on that and looks doable > > > > > > > > awesome. > > > > > > > > > another problem might be that this update function will need to be called with > > > > > all related trampoline locks, which in this case would be thousands > > > > > > > > sure. but these will be newly allocated trampolines and > > > > brand new mutexes, so no contention. > > > > But thousands of cmpxchg-s will take time. Would be good to measure > > > > though. It might not be that bad. > > > > > > What about the memory overhead of thousands of trampolines and > > > trampoline images? Seems very wasteful to create one per each attach, > > > when each attachment in general will be identical. > > > > > > > > > If I remember correctly, last time we were also discussing creating a > > > generic BPF trampoline that would save all 6 input registers, > > > regardless of function's BTF signature. Such BPF trampoline should > > > support calling both generic fentry/fexit programs and typed ones, > > > because all the necessary data is stored on the stack correctly. > > > > > > For the case when typed (non-generic) BPF trampoline is already > > > attached to a function and now we are attaching generic fentry, why > > > can't we "upgrade" existing BPF trampoline to become generic, and then > > > just add generic multi-fentry program to its trampoline image? Once > > > that multi-fentry is detached, we might choose to convert trampoline > > > back to typed BPF trampoline (i.e., save only necessary registers, not > > > all 6 of them), but that's more like an optimization, it doesn't have > > > to happen. > > > > > > Or is there something that would make such generic trampoline impossible? > > > > > > If we go with this approach, then each multi-fentry attachment will be > > > creating minimum amount of trampolines, determined by all the > > > combinations of attached programs at that point. If after we attach > > > multi-fentry to some set of functions we need to attach another > > > multi-fentry or typed fentry, we'd potentially need to split > > > trampolines and create a bit more of them. But while that sounds a bit > > > complicated, we do all that under locks so there isn't much problem in > > > doing that, no? > > > > > > But in general, I agree with Alexei that this restriction on not being > > > able to attach to a function once multi-attach trampoline is attached > > > to it is a really-really bad restriction in production, where we can't > > > control exactly what BPF apps run and in which order. > > > > ah ok.. attaching single trampoline on top of attached multi trampoline > > should be possible to add.. as long as one side of the problem is single > > trampoline it should be doable, I'll check > > > > leaving the restriction only to attaching one multi trampoline over > > another (not equal) attached multi trampoline > > > > would that be acceptable? > > I guess I'm missing what's fundamentally different between > multi-trampoline + single trampoline vs multi-tramp + multi-tramp? > Multi-tramp is already saving all registers, so can "host" other > generic fentry/fexit. So why this multi + multi restriction? so I did not find good generic solution for multi trampoline being attached on top of already attached multi trampolines say we have following multi trampolines: multi_a [1,2,3] P1 multi_b [4,5,6] P2 and want to add another multi trampoline: multi_c [1,4,7] P3 you end up with 5 new trampolines: multi_1 [1] P1,P3 multi_2 [2,3] P1 multi_3 [4] P2,P3 multi_4 [5,6] P2 multi_5 [7] P3 pain points were: - lookup for intersection on top of BTF ids, doable but tricky https://lore.kernel.org/bpf/20211118112455.475349-20-jolsa@xxxxxxxxxx/ - splitting existing trampolines and rollback in case of error, because the image update and ip are 2 separate things but we do them together https://lore.kernel.org/bpf/20211118112455.475349-20-jolsa@xxxxxxxxxx/ - trampoline can't be stored and managed in link, because it can be split into multiple new trampolines, so I added new layer to keep them https://lore.kernel.org/bpf/20211118112455.475349-14-jolsa@xxxxxxxxxx/ - all this must be locked.. all involved trampolines or one big lock - any new attachment of single trampoline is possibly splitting existing multi trampoline - when multi_c trampoline is detached we don't roll back to get original multi_a and multi_b - we keep the split trampolines, so each new attachment is making more trampolines and makes the new attachment possibly slower the RFC for this is here: https://lore.kernel.org/bpf/20211118112455.475349-1-jolsa@xxxxxxxxxx/ it did not bring too much attention so I simplified it down to the current version ;-) so far I could not think of better way than start with basic functionality and add/rethink the complex multi/multi stuff later if needed, hopefully with some better idea how to do that note this was no problem for kprobe_multi which uses fprobe/ftrace_ops code that takes care of this - we just say this function should be called from set of ips and ftrace machinery does all the merging with existing attachments but bpf trampolines use ftrace direct interface, which only attaches trampoline to given function without any other logic jirka