Re: [PATCH RFC bpf-next 0/4] bpf: Add support to attach return prog in kprobe multi

Jiri Olsa <olsajiri@xxxxxxxxx> · Tue, 13 Feb 2024 13:09:54 +0100

On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:

SNIP

> > > But the way you implement it with extra flag and extra fd parameter
> > > makes it harder to have a nice high-level support in libbpf (and
> > > presumably other BPF loader libraries) for this.
> > >
> > > When I was thinking about doing something like this, I was considering
> > > adding a new program type, actually. That way it's possible to define
> > > this "let's skip return probe" protocol without backwards
> > > compatibility concerns. It's easier to use it declaratively in libbpf.
> >
> > ok, that seems cleaner.. but we need to use current kprobe programs,
> > so not sure at the moment how would that fit in.. did you mean new
> > link type?
> 
> It's kind of a less important detail, actually. New program type would
> allow us to have an entirely different context type, but I think we
> can make do with the existing kprobe program type. We can have a
> separate attach_type and link type, just like multi-kprobe and
> multi-uprobe are still kprobe programs.

ok, having new attach type on top of kprobe_multi link makes sense

> 
> >
> > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > something to designate that it's both entry and exit probe) as one
> > > program and in the code there would be some way to determine whether
> > > we are in entry mode or exit mode (helper or field in the custom
> > > context type, the latter being faster and more usable, but it's
> > > probably not critical).
> >
> > hum, so the single program would be for both entry and exit probe,
> > I'll need to check how bad it'd be for us, but it'd probably mean
> > just one extra tail call, so it's likely ok
> 
> I guess, I don't know what you are doing there :) I'd recommend
> looking at utilizing BPF global subprogs instead of tail calls, if
> your kernel allows for that, as that's a saner way to scale BPF
> verification.

ok, we should probably do that.. given this enhancement will be
available on latest kernel anyway, we could use global subprogs
as well

the related bpftrace might be bit more challenging.. will have to
generate program calling entry or return program now, but seems
doable of course

> 
> >
> > >
> > > Another frequently requested feature and a very common use case is to
> > > measure duration of the function, so if we have a custom type, we can
> > > have a field to record entry timestamp and let user calculate
> >
> > you mean bpf program context right?
> 
> Yes, when I was writing that I was imagining a new context type with new fields.
> 
> But thinking about this a bit more, I like a slightly different
> approach now. Instead of having a new context type, kernel capturing
> timestamps, etc, what if we introduce a concept of "a session". This
> paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common
> session. So, say, if we are tracing kernel function abc() and we have
> our kwrapper (let's call it that for now, kwrapper and uwrapper?)
> program attached to abc, we have something like this:
> 
> 1) abc() is called
> 2) we intercept it at its entry, initialize "session"
> 3) call our kwrapper program in entry mode
> 4) abc() proceeds, until return
> 5) our kwrapper program is called in exit mode
> 6) at this point the session has ended
> 
> If abc() is called again or in parallel on another CPU, each such
> abc() invocation (with kwrapper prog) has its own session state.
> 
> Now, it all sounds scary, but a session could be really just a 8 byte
> value that has to last for the duration of a single kprobe+kretprobe
> execution.
> 
> Imagine we introduce just one new kfunc that would be usable from
> kwrapper and uwrapper programs:
> 
> u64 *bpf_get_session_cookie(void);
> 
> It returns a pointer to an 8-byte slot which initially is set to zero
> by kernel (before kprobe/entry part). And then the program can put
> anything into it. Either from entry (kprobe/uprobe) or exit
> (kretprobe/uretprobe) executions. Whatever value was stored in entry
> mode will still be there in exit mode as well.
> 
> This gives a lot of flexibility without requiring the kernel to do
> anything expensive.
> 
> E.g., it's easy to detect if kwrapper program is an entry call or exit call:
> 
> u64 *cookie = bpf_get_session_cookie();
> if (*cookie == 0) {
>   /* entry mode */
>   *cookie = 1;
> } else {
>   /* exit mode */
> }
> 
> Or if we wanted to measure function duration/latency:
> 
> u64 *cookie = bpf_get_session_cookie();
> if (*cookie == 0) {
>   *cookie = bpf_get_ktime_ns();
> } else {
>   u64 duration = bpf_get_ktime_ns() - *cookie;
>   /* output duration somehow */
> }
> 
> Yes, I realize special-casing zero might be a bit inconvenient, but I
> think simplicity trumps a potential for zero to be a valid value (and
> there are always ways to work around zero as a meaningful value).
> 
> Now, in more complicated cases 8 bytes of temporary session state
> isn't enough, just like BPF cookie being 8 byte (read-only) value
> might not be enough. But the solution is the same as with the BPF
> cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
> storage. It's simple and fast enough for pretty much any case.

I was recently asked for a way to have function arguments available
in the return kprobe as it is in fexit programs (which was not an
option to use, because we don't have fast multi attach for it)

using the hash map to store arguments and storing its key in the
session data might be solution for this

> 
> 
> Now, how do we implement this storage session? I recently looked into
> uprobe/uretprobe implementation, and I know that for uretprobes there
> is a dynamically allocated uretprobe_instance (or whatever the name)
> allocated for each runtime invocation of uretprobe. We can use that.
> We'll need to allocate this instance a bit earlier, before uprobe part
> of uwrapper BPF program has to be executed, initialize session cookie
> to zero, and store pointer to this memory in bpf_run_ctx (just like we
> do for BPF cookie).

I think you refer to the return_instance and it seems like we could use it,
I'll check on that

> 
> I bet there is something similar in the kretprobe case, where we can
> carve out 8 bytes and pass it to both entry and exit parts of kwrapper
> program.

for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
return probes, so I guess we could use it and store that shared data
in there

btw Masami is in process of removing rethook from kprobe_multi/fprobe,
as part of migrating fprobe on top of ftrace [0]

but instead the rethook I think there'll be some sort of shadow stack/data
area accessible from both entry and return probes, that we could use

jirka

[0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/

> 
> So anyways, might need some internal refactoring, but seems very
> doable and will give a lot of power and flexibility for tracing use
> cases.
> 
> It would be cool to also have a similar fentry/fexit combo with "BPF
> session cookie", but that's something we can think about separately,
> probably.
> 
> 
> >
> > > duration, if necessary. Without this users have to pay a bunch of
> > > extra overhead to record timestamp and put it into hashmap (keyed by
> > > thread id) or thread local storage (even if they have no other use for
> > > thread local storage).
> >
> > sounds good
> >
> > >
> > > Also, consider that a similar concept is applicable to uprobes and we
> > > should do that as well, in similar fashion. And the above approach
> > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> > > they have the same pt_regs data structure as a context (even if for
> > > exit probes most of the values of pt_regs are not that meaningful).
> >
> > ok, that seems useful for uprobes as well
> 
> yes, absolutely
> 
> >
> > btw I have another unrelated change in stash that that let you choose
> > the bpf_prog_active or prog->active re-entry check before program is
> > executed in krobe_multi link.. I also added extra flag, and it still
> > seems like good idea to me, thoughts? ;-)
> >
> 
> no specific opinion on this from my side, I guess
> 
> > >
> > > So anyways, great feature, but let's discuss end-to-end usability
> > > story before we finalize the implementation?
> >
> > yep, it does say RFC in the subject ;-)
> >
> > >
> > >
> 
> [...]