Re: [PATCH RFC bpf-next 0/4] bpf: Add support to attach return prog in kprobe multi

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 12 Feb 2024 20:06:06 -0800

On Sat, Feb 10, 2024 at 7:31 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
>
> On Thu, Feb 08, 2024 at 11:35:18AM -0800, Andrii Nakryiko wrote:
> > On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> > >
> > > hi,
> > > adding support to attach both entry and return bpf program on single
> > > kprobe multi link.
> > >
> > > Having entry together with return probe for given function is common
> > > use case for tetragon, bpftrace and most likely for others.
> > >
> > > At the moment if we want both entry and return probe to execute bpf
> > > program we need to create two (entry and return probe) links. The link
> > > for return probe creates extra entry probe to setup the return probe.
> > > The extra entry probe execution could be omitted if we had a way to
> > > use just single link for both entry and exit probe.
> > >
> > > In addition it's possible to control the execution of the return probe
> > > with the return value of the entry bpf program. If the entry program
> > > returns 0 the return probe is installed and executed, otherwise it's
> > > skip.
> > >
> >
> > In general, I think this is a very useful ability to have a combined
> > entry/return program.
>
> great, thanks
>
> >
> > But the way you implement it with extra flag and extra fd parameter
> > makes it harder to have a nice high-level support in libbpf (and
> > presumably other BPF loader libraries) for this.
> >
> > When I was thinking about doing something like this, I was considering
> > adding a new program type, actually. That way it's possible to define
> > this "let's skip return probe" protocol without backwards
> > compatibility concerns. It's easier to use it declaratively in libbpf.
>
> ok, that seems cleaner.. but we need to use current kprobe programs,
> so not sure at the moment how would that fit in.. did you mean new
> link type?

It's kind of a less important detail, actually. New program type would
allow us to have an entirely different context type, but I think we
can make do with the existing kprobe program type. We can have a
separate attach_type and link type, just like multi-kprobe and
multi-uprobe are still kprobe programs.

>
> > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > something to designate that it's both entry and exit probe) as one
> > program and in the code there would be some way to determine whether
> > we are in entry mode or exit mode (helper or field in the custom
> > context type, the latter being faster and more usable, but it's
> > probably not critical).
>
> hum, so the single program would be for both entry and exit probe,
> I'll need to check how bad it'd be for us, but it'd probably mean
> just one extra tail call, so it's likely ok

I guess, I don't know what you are doing there :) I'd recommend
looking at utilizing BPF global subprogs instead of tail calls, if
your kernel allows for that, as that's a saner way to scale BPF
verification.

>
> >
> > Another frequently requested feature and a very common use case is to
> > measure duration of the function, so if we have a custom type, we can
> > have a field to record entry timestamp and let user calculate
>
> you mean bpf program context right?

Yes, when I was writing that I was imagining a new context type with new fields.

But thinking about this a bit more, I like a slightly different
approach now. Instead of having a new context type, kernel capturing
timestamps, etc, what if we introduce a concept of "a session". This
paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common
session. So, say, if we are tracing kernel function abc() and we have
our kwrapper (let's call it that for now, kwrapper and uwrapper?)
program attached to abc, we have something like this:

1) abc() is called
2) we intercept it at its entry, initialize "session"
3) call our kwrapper program in entry mode
4) abc() proceeds, until return
5) our kwrapper program is called in exit mode
6) at this point the session has ended

If abc() is called again or in parallel on another CPU, each such
abc() invocation (with kwrapper prog) has its own session state.

Now, it all sounds scary, but a session could be really just a 8 byte
value that has to last for the duration of a single kprobe+kretprobe
execution.

Imagine we introduce just one new kfunc that would be usable from
kwrapper and uwrapper programs:

u64 *bpf_get_session_cookie(void);

It returns a pointer to an 8-byte slot which initially is set to zero
by kernel (before kprobe/entry part). And then the program can put
anything into it. Either from entry (kprobe/uprobe) or exit
(kretprobe/uretprobe) executions. Whatever value was stored in entry
mode will still be there in exit mode as well.

This gives a lot of flexibility without requiring the kernel to do
anything expensive.

E.g., it's easy to detect if kwrapper program is an entry call or exit call:

u64 *cookie = bpf_get_session_cookie();
if (*cookie == 0) {
  /* entry mode */
  *cookie = 1;
} else {
  /* exit mode */
}

Or if we wanted to measure function duration/latency:

u64 *cookie = bpf_get_session_cookie();
if (*cookie == 0) {
  *cookie = bpf_get_ktime_ns();
} else {
  u64 duration = bpf_get_ktime_ns() - *cookie;
  /* output duration somehow */
}

Yes, I realize special-casing zero might be a bit inconvenient, but I
think simplicity trumps a potential for zero to be a valid value (and
there are always ways to work around zero as a meaningful value).

Now, in more complicated cases 8 bytes of temporary session state
isn't enough, just like BPF cookie being 8 byte (read-only) value
might not be enough. But the solution is the same as with the BPF
cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
storage. It's simple and fast enough for pretty much any case.

Now, how do we implement this storage session? I recently looked into
uprobe/uretprobe implementation, and I know that for uretprobes there
is a dynamically allocated uretprobe_instance (or whatever the name)
allocated for each runtime invocation of uretprobe. We can use that.
We'll need to allocate this instance a bit earlier, before uprobe part
of uwrapper BPF program has to be executed, initialize session cookie
to zero, and store pointer to this memory in bpf_run_ctx (just like we
do for BPF cookie).

I bet there is something similar in the kretprobe case, where we can
carve out 8 bytes and pass it to both entry and exit parts of kwrapper
program.

So anyways, might need some internal refactoring, but seems very
doable and will give a lot of power and flexibility for tracing use
cases.

It would be cool to also have a similar fentry/fexit combo with "BPF
session cookie", but that's something we can think about separately,
probably.

>
> > duration, if necessary. Without this users have to pay a bunch of
> > extra overhead to record timestamp and put it into hashmap (keyed by
> > thread id) or thread local storage (even if they have no other use for
> > thread local storage).
>
> sounds good
>
> >
> > Also, consider that a similar concept is applicable to uprobes and we
> > should do that as well, in similar fashion. And the above approach
> > works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> > they have the same pt_regs data structure as a context (even if for
> > exit probes most of the values of pt_regs are not that meaningful).
>
> ok, that seems useful for uprobes as well

yes, absolutely

>
> btw I have another unrelated change in stash that that let you choose
> the bpf_prog_active or prog->active re-entry check before program is
> executed in krobe_multi link.. I also added extra flag, and it still
> seems like good idea to me, thoughts? ;-)
>

no specific opinion on this from my side, I guess

> >
> > So anyways, great feature, but let's discuss end-to-end usability
> > story before we finalize the implementation?
>
> yep, it does say RFC in the subject ;-)
>
> >
> >

[...]