Re: [PATCH RFC bpf-next 0/4] bpf: Add support to attach return prog in kprobe multi

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 8 Feb 2024 11:35:18 -0800

On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
>
> hi,
> adding support to attach both entry and return bpf program on single
> kprobe multi link.
>
> Having entry together with return probe for given function is common
> use case for tetragon, bpftrace and most likely for others.
>
> At the moment if we want both entry and return probe to execute bpf
> program we need to create two (entry and return probe) links. The link
> for return probe creates extra entry probe to setup the return probe.
> The extra entry probe execution could be omitted if we had a way to
> use just single link for both entry and exit probe.
>
> In addition it's possible to control the execution of the return probe
> with the return value of the entry bpf program. If the entry program
> returns 0 the return probe is installed and executed, otherwise it's
> skip.
>

In general, I think this is a very useful ability to have a combined
entry/return program.

But the way you implement it with extra flag and extra fd parameter
makes it harder to have a nice high-level support in libbpf (and
presumably other BPF loader libraries) for this.

When I was thinking about doing something like this, I was considering
adding a new program type, actually. That way it's possible to define
this "let's skip return probe" protocol without backwards
compatibility concerns. It's easier to use it declaratively in libbpf.
You just declare SEC("kprobe.wrap/...") (or whatever the name,
something to designate that it's both entry and exit probe) as one
program and in the code there would be some way to determine whether
we are in entry mode or exit mode (helper or field in the custom
context type, the latter being faster and more usable, but it's
probably not critical).

Another frequently requested feature and a very common use case is to
measure duration of the function, so if we have a custom type, we can
have a field to record entry timestamp and let user calculate
duration, if necessary. Without this users have to pay a bunch of
extra overhead to record timestamp and put it into hashmap (keyed by
thread id) or thread local storage (even if they have no other use for
thread local storage).

Also, consider that a similar concept is applicable to uprobes and we
should do that as well, in similar fashion. And the above approach
works for both kprobe/kretprobe and uprobe/uretprobe cases, because
they have the same pt_regs data structure as a context (even if for
exit probes most of the values of pt_regs are not that meaningful).

So anyways, great feature, but let's discuss end-to-end usability
story before we finalize the implementation?

> I'm still working on the tetragon change, so I'll be sending non-RFC
> version once that's ready, meanwhile any ideas and feedback on the
> approach would be great.
>
> The change in bpftrace [1] using the new interface shows speed increase
> with tracing perf bench messaging:
>
>   # perf bench sched messaging -l 100000
>
> having system wide bpftrace:
>
>   # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }'
>
> without bpftrace:
>
>   # Running 'sched/messaging' benchmark:
>   # 20 sender and receiver processes per group
>   # 10 groups == 400 processes run
>
>        Total time: 119.595 [sec]
>
>    Performance counter stats for 'perf bench sched messaging -l 100000':
>
>      102,419,967,282      cycles:u
>    5,652,444,107,001      cycles:k
>    5,782,645,019,612      cycles
>       22,187,151,206      instructions:u                   #    0.22  insn per cycle
>    2,979,040,498,455      instructions:k                   #    0.53  insn per cycle
>
>        119.671169829 seconds time elapsed
>
>         94.959198000 seconds user
>       1815.371616000 seconds sys
>
> with current bpftrace:
>
>   # Running 'sched/messaging' benchmark:
>   # 20 sender and receiver processes per group
>   # 10 groups == 400 processes run
>
>        Total time: 221.153 [sec]
>
>    Performance counter stats for 'perf bench sched messaging -l 100000':
>
>      125,292,164,504      cycles:u

btw, why +25% in user space?... this looks weird

>   10,315,020,393,735      cycles:k
>   10,501,379,274,042      cycles
>       22,187,583,545      instructions:u                   #    0.18  insn per cycle
>    4,856,893,111,303      instructions:k                   #    0.47  insn per cycle
>
>        221.229234283 seconds time elapsed
>
>        103.792498000 seconds user
>       3432.643302000 seconds sys
>
> with bpftrace using the new interface:
>
>   # Running 'sched/messaging' benchmark:
>   # 20 sender and receiver processes per group
>   # 10 groups == 400 processes run
>
>        Total time: 157.825 [sec]
>
>    Performance counter stats for 'perf bench sched messaging -l 100000':
>
>      102,423,112,279      cycles:u
>    7,450,856,354,744      cycles:k
>    7,584,769,726,693      cycles
>       22,187,270,661      instructions:u                   #    0.22  insn per cycle
>    3,985,522,383,425      instructions:k                   #    0.53  insn per cycle
>
>        157.900787760 seconds time elapsed
>
>         97.953898000 seconds user
>       2425.314753000 seconds sys
>
> thanks,
> jirka
>
>
> [1] https://github.com/bpftrace/bpftrace/pull/2984
> ---
> Jiri Olsa (4):
>       fprobe: Add entry/exit callbacks types
>       bpf: Add return prog to kprobe multi
>       libbpf: Add return_prog_fd to kprobe multi opts
>       selftests/bpf: Add kprobe multi return prog test
>
>  include/linux/fprobe.h                                       |  18 ++++++++++------
>  include/uapi/linux/bpf.h                                     |   4 +++-
>  kernel/trace/bpf_trace.c                                     |  50 ++++++++++++++++++++++++++++++++-----------
>  tools/include/uapi/linux/bpf.h                               |   4 +++-
>  tools/lib/bpf/bpf.c                                          |   1 +
>  tools/lib/bpf/bpf.h                                          |   1 +
>  tools/lib/bpf/libbpf.c                                       |   5 +++++
>  tools/lib/bpf/libbpf.h                                       |   6 +++++-
>  tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c   |  53 +++++++++++++++++++++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  10 files changed, 226 insertions(+), 21 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c