On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > hi, > adding support to attach both entry and return bpf program on single > kprobe multi link. > > Having entry together with return probe for given function is common > use case for tetragon, bpftrace and most likely for others. > > At the moment if we want both entry and return probe to execute bpf > program we need to create two (entry and return probe) links. The link > for return probe creates extra entry probe to setup the return probe. > The extra entry probe execution could be omitted if we had a way to > use just single link for both entry and exit probe. > > In addition it's possible to control the execution of the return probe > with the return value of the entry bpf program. If the entry program > returns 0 the return probe is installed and executed, otherwise it's > skip. > In general, I think this is a very useful ability to have a combined entry/return program. But the way you implement it with extra flag and extra fd parameter makes it harder to have a nice high-level support in libbpf (and presumably other BPF loader libraries) for this. When I was thinking about doing something like this, I was considering adding a new program type, actually. That way it's possible to define this "let's skip return probe" protocol without backwards compatibility concerns. It's easier to use it declaratively in libbpf. You just declare SEC("kprobe.wrap/...") (or whatever the name, something to designate that it's both entry and exit probe) as one program and in the code there would be some way to determine whether we are in entry mode or exit mode (helper or field in the custom context type, the latter being faster and more usable, but it's probably not critical). Another frequently requested feature and a very common use case is to measure duration of the function, so if we have a custom type, we can have a field to record entry timestamp and let user calculate duration, if necessary. Without this users have to pay a bunch of extra overhead to record timestamp and put it into hashmap (keyed by thread id) or thread local storage (even if they have no other use for thread local storage). Also, consider that a similar concept is applicable to uprobes and we should do that as well, in similar fashion. And the above approach works for both kprobe/kretprobe and uprobe/uretprobe cases, because they have the same pt_regs data structure as a context (even if for exit probes most of the values of pt_regs are not that meaningful). So anyways, great feature, but let's discuss end-to-end usability story before we finalize the implementation? > I'm still working on the tetragon change, so I'll be sending non-RFC > version once that's ready, meanwhile any ideas and feedback on the > approach would be great. > > The change in bpftrace [1] using the new interface shows speed increase > with tracing perf bench messaging: > > # perf bench sched messaging -l 100000 > > having system wide bpftrace: > > # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }' > > without bpftrace: > > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 10 groups == 400 processes run > > Total time: 119.595 [sec] > > Performance counter stats for 'perf bench sched messaging -l 100000': > > 102,419,967,282 cycles:u > 5,652,444,107,001 cycles:k > 5,782,645,019,612 cycles > 22,187,151,206 instructions:u # 0.22 insn per cycle > 2,979,040,498,455 instructions:k # 0.53 insn per cycle > > 119.671169829 seconds time elapsed > > 94.959198000 seconds user > 1815.371616000 seconds sys > > with current bpftrace: > > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 10 groups == 400 processes run > > Total time: 221.153 [sec] > > Performance counter stats for 'perf bench sched messaging -l 100000': > > 125,292,164,504 cycles:u btw, why +25% in user space?... this looks weird > 10,315,020,393,735 cycles:k > 10,501,379,274,042 cycles > 22,187,583,545 instructions:u # 0.18 insn per cycle > 4,856,893,111,303 instructions:k # 0.47 insn per cycle > > 221.229234283 seconds time elapsed > > 103.792498000 seconds user > 3432.643302000 seconds sys > > with bpftrace using the new interface: > > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 10 groups == 400 processes run > > Total time: 157.825 [sec] > > Performance counter stats for 'perf bench sched messaging -l 100000': > > 102,423,112,279 cycles:u > 7,450,856,354,744 cycles:k > 7,584,769,726,693 cycles > 22,187,270,661 instructions:u # 0.22 insn per cycle > 3,985,522,383,425 instructions:k # 0.53 insn per cycle > > 157.900787760 seconds time elapsed > > 97.953898000 seconds user > 2425.314753000 seconds sys > > thanks, > jirka > > > [1] https://github.com/bpftrace/bpftrace/pull/2984 > --- > Jiri Olsa (4): > fprobe: Add entry/exit callbacks types > bpf: Add return prog to kprobe multi > libbpf: Add return_prog_fd to kprobe multi opts > selftests/bpf: Add kprobe multi return prog test > > include/linux/fprobe.h | 18 ++++++++++------ > include/uapi/linux/bpf.h | 4 +++- > kernel/trace/bpf_trace.c | 50 ++++++++++++++++++++++++++++++++----------- > tools/include/uapi/linux/bpf.h | 4 +++- > tools/lib/bpf/bpf.c | 1 + > tools/lib/bpf/bpf.h | 1 + > tools/lib/bpf/libbpf.c | 5 +++++ > tools/lib/bpf/libbpf.h | 6 +++++- > tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c | 53 +++++++++++++++++++++++++++++++++++++++++++++ > tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 10 files changed, 226 insertions(+), 21 deletions(-) > create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c