On Tue, Aug 31, 2021 at 04:51:18PM -0700, Andrii Nakryiko wrote: > On Thu, Aug 26, 2021 at 12:41 PM Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > > > When we have multi func program attached, the trampoline > > switched to the function model of the multi func program. > > > > This breaks already attached standard programs, for example > > when we attach following program: > > > > SEC("fexit/bpf_fentry_test2") > > int BPF_PROG(test1, int a, __u64 b, int ret) > > > > the trampoline pushes on stack args 'a' and 'b' and return > > value 'ret'. > > > > When following multi func program is attached to bpf_fentry_test2: > > > > SEC("fexit.multi/bpf_fentry_test*") > > int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d, > > __u64 e, __u64 f, int ret) > > > > the trampoline takes this program model and pushes all 6 args > > and return value on stack. > > > > But we still have the original 'test1' program attached, that > > expects 'ret' value where there's 'c' argument now: > > > > test1(a, b, c) > > > > To fix that we simply overwrite 'c' argument with 'ret' value, > > so test1 is called as expected and test2 gets called as: > > > > test2(a, b, ret, d, e, f, ret) > > > > which is ok, because 'c' is not defined for bpf_fentry_test2 > > anyway. > > > > What if we change the order on the stack to be the return value first, > followed by input arguments. That would get us a bit closer to > unifying multi-trampoline and the normal one, right? BPF verifier > should be able to rewrite access to the last argument (i.e., return > value) for fexit programs to actually be at offset 0, and shift all > other arguments by 8 bytes. For fentry, if that helps to keep things > more aligned, we'd just skip the first 8 bytes on the stack and store > all the input arguments in the same offsets. So BPF verifier rewriting > logic stays consistent (except offset 0 will be disallowed). nice idea, with this in place we could cut that args re-arranging code > > Basically, I'm thinking how we can make normal and multi trampolines > more interoperable to remove those limitations that two > multi-trampolines can't be attached to the same function, which seems > like a pretty annoying limitation which will be easy to hit in > practice. Alexei previously proposed (as an optimization) to group all > to-be-attached functions into groups by number of arguments, so that > we can have up to 6 different trampolines tailored to actual functions > being attached. So that we don't save unnecessary extra input > arguments saving, which will be even more important once we allow more > than 6 arguments in the future. > > With such logic, we should be able to split all the functions into > multiple underlying trampolines, so it seems like it should be > possible to also allow multiple multi-fentry programs to be attached > to the same function by having a separate bpf_trampoline just for > those functions. It will be just an extension of the above "just 6 > trampolines" strategy to "as much as we need trampolines". I'm probably missing something here.. say we have 2 functions with single argument: foo1(int a) foo2(int b) then having 2 programs: A - attaching to foo1 B - attaching to foo2 then you need to have 2 different trampolines instead of single 'generic-1-argument-trampoline' > > It's just a vague idea, sorry, I don't understand all the code yet. > But the limitation outlined in one of the previous patches seems very > limiting and unpleasant. I can totally see that some 24/7 running BPF > tracing app uses multi-fentry for tracing a small subset of kernel > functions non-stop, and then someone is trying to use bpftrace or > retsnoop to trace overlapping set of functions. And it immediately > fails. Very frustrating. so the current approach is to some extent driven by the direct ftrace batch API: you have ftrace_ops object and set it up with functions you want to change with calling: ftrace_set_filter_ip(ops, ip1); ftrace_set_filter_ip(ops, ip2); ... and then register trampoline with those functions: register_ftrace_direct_multi(ops, tramp_addr); and with this call being the expensive one (it does the actual work and sync waiting), my objective was to call it just once for update now with 2 intersecting multi trampolines we end up with 3 functions sets: A - functions for first multi trampoline B - functions for second multi trampoline C - intersection of them each set needs different trampoline: tramp A - calls program for first multi trampoline tramp B - calls program for second multi trampoline tramp C - calls both programs so we need to call register_ftrace_direct_multi 3 times if we allow also standard trampolines being attached, it makes it even more complicated and ultimatelly gets broken to 1-function/1-trampoline pairs, ending up with attach speed that we have now ... I have test code for ftrace direct interface that would allow to register/change separate function/addr pairs, so in one call you can change multiple ips each to different tramp addresss but even with that, I ended up with lot of new complexity on bpf side keeping track of multi trampolines intersections, so I thought I'd start with something limited and simpler perhaps I should move back to that approach and see how bad it ends ;-) or this could be next step on top of current work, that should get simpler with the args re-arranging you proposed jirka