Re: [PATCH RFC bpf-next 0/4] bpf: Add support to attach return prog in kprobe multi

Jiri Olsa <olsajiri@xxxxxxxxx> · Sat, 10 Feb 2024 16:31:04 +0100

On Thu, Feb 08, 2024 at 11:35:18AM -0800, Andrii Nakryiko wrote:
> On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> >
> > hi,
> > adding support to attach both entry and return bpf program on single
> > kprobe multi link.
> >
> > Having entry together with return probe for given function is common
> > use case for tetragon, bpftrace and most likely for others.
> >
> > At the moment if we want both entry and return probe to execute bpf
> > program we need to create two (entry and return probe) links. The link
> > for return probe creates extra entry probe to setup the return probe.
> > The extra entry probe execution could be omitted if we had a way to
> > use just single link for both entry and exit probe.
> >
> > In addition it's possible to control the execution of the return probe
> > with the return value of the entry bpf program. If the entry program
> > returns 0 the return probe is installed and executed, otherwise it's
> > skip.
> >
> 
> In general, I think this is a very useful ability to have a combined
> entry/return program.

great, thanks

> 
> But the way you implement it with extra flag and extra fd parameter
> makes it harder to have a nice high-level support in libbpf (and
> presumably other BPF loader libraries) for this.
> 
> When I was thinking about doing something like this, I was considering
> adding a new program type, actually. That way it's possible to define
> this "let's skip return probe" protocol without backwards
> compatibility concerns. It's easier to use it declaratively in libbpf.

ok, that seems cleaner.. but we need to use current kprobe programs,
so not sure at the moment how would that fit in.. did you mean new
link type?

> You just declare SEC("kprobe.wrap/...") (or whatever the name,
> something to designate that it's both entry and exit probe) as one
> program and in the code there would be some way to determine whether
> we are in entry mode or exit mode (helper or field in the custom
> context type, the latter being faster and more usable, but it's
> probably not critical).

hum, so the single program would be for both entry and exit probe,
I'll need to check how bad it'd be for us, but it'd probably mean
just one extra tail call, so it's likely ok

> 
> Another frequently requested feature and a very common use case is to
> measure duration of the function, so if we have a custom type, we can
> have a field to record entry timestamp and let user calculate

you mean bpf program context right?

> duration, if necessary. Without this users have to pay a bunch of
> extra overhead to record timestamp and put it into hashmap (keyed by
> thread id) or thread local storage (even if they have no other use for
> thread local storage).

sounds good

> 
> Also, consider that a similar concept is applicable to uprobes and we
> should do that as well, in similar fashion. And the above approach
> works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> they have the same pt_regs data structure as a context (even if for
> exit probes most of the values of pt_regs are not that meaningful).

ok, that seems useful for uprobes as well

btw I have another unrelated change in stash that that let you choose
the bpf_prog_active or prog->active re-entry check before program is
executed in krobe_multi link.. I also added extra flag, and it still
seems like good idea to me, thoughts? ;-)

> 
> So anyways, great feature, but let's discuss end-to-end usability
> story before we finalize the implementation?

yep, it does say RFC in the subject ;-)

> 
> 
> > I'm still working on the tetragon change, so I'll be sending non-RFC
> > version once that's ready, meanwhile any ideas and feedback on the
> > approach would be great.
> >
> > The change in bpftrace [1] using the new interface shows speed increase
> > with tracing perf bench messaging:
> >
> >   # perf bench sched messaging -l 100000
> >
> > having system wide bpftrace:
> >
> >   # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }'
> >
> > without bpftrace:
> >
> >   # Running 'sched/messaging' benchmark:
> >   # 20 sender and receiver processes per group
> >   # 10 groups == 400 processes run
> >
> >        Total time: 119.595 [sec]
> >
> >    Performance counter stats for 'perf bench sched messaging -l 100000':
> >
> >      102,419,967,282      cycles:u
> >    5,652,444,107,001      cycles:k
> >    5,782,645,019,612      cycles
> >       22,187,151,206      instructions:u                   #    0.22  insn per cycle
> >    2,979,040,498,455      instructions:k                   #    0.53  insn per cycle
> >
> >        119.671169829 seconds time elapsed
> >
> >         94.959198000 seconds user
> >       1815.371616000 seconds sys
> >
> > with current bpftrace:
> >
> >   # Running 'sched/messaging' benchmark:
> >   # 20 sender and receiver processes per group
> >   # 10 groups == 400 processes run
> >
> >        Total time: 221.153 [sec]
> >
> >    Performance counter stats for 'perf bench sched messaging -l 100000':
> >
> >      125,292,164,504      cycles:u
> 
> btw, why +25% in user space?... this looks weird

hmm right.. maybe there's some bpftrace timer code that got triggered/executed
more times because it takes long to finish the workload? I'll check

thanks,
jirka

> 
> 
> >   10,315,020,393,735      cycles:k
> >   10,501,379,274,042      cycles
> >       22,187,583,545      instructions:u                   #    0.18  insn per cycle
> >    4,856,893,111,303      instructions:k                   #    0.47  insn per cycle
> >
> >        221.229234283 seconds time elapsed
> >
> >        103.792498000 seconds user
> >       3432.643302000 seconds sys
> >
> > with bpftrace using the new interface:
> >
> >   # Running 'sched/messaging' benchmark:
> >   # 20 sender and receiver processes per group
> >   # 10 groups == 400 processes run
> >
> >        Total time: 157.825 [sec]
> >
> >    Performance counter stats for 'perf bench sched messaging -l 100000':
> >
> >      102,423,112,279      cycles:u
> >    7,450,856,354,744      cycles:k
> >    7,584,769,726,693      cycles
> >       22,187,270,661      instructions:u                   #    0.22  insn per cycle
> >    3,985,522,383,425      instructions:k                   #    0.53  insn per cycle
> >
> >        157.900787760 seconds time elapsed
> >
> >         97.953898000 seconds user
> >       2425.314753000 seconds sys
> >
> > thanks,
> > jirka
> >
> >
> > [1] https://github.com/bpftrace/bpftrace/pull/2984
> > ---
> > Jiri Olsa (4):
> >       fprobe: Add entry/exit callbacks types
> >       bpf: Add return prog to kprobe multi
> >       libbpf: Add return_prog_fd to kprobe multi opts
> >       selftests/bpf: Add kprobe multi return prog test
> >
> >  include/linux/fprobe.h                                       |  18 ++++++++++------
> >  include/uapi/linux/bpf.h                                     |   4 +++-
> >  kernel/trace/bpf_trace.c                                     |  50 ++++++++++++++++++++++++++++++++-----------
> >  tools/include/uapi/linux/bpf.h                               |   4 +++-
> >  tools/lib/bpf/bpf.c                                          |   1 +
> >  tools/lib/bpf/bpf.h                                          |   1 +
> >  tools/lib/bpf/libbpf.c                                       |   5 +++++
> >  tools/lib/bpf/libbpf.h                                       |   6 +++++-
> >  tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c   |  53 +++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  10 files changed, 226 insertions(+), 21 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c