Re: [RFCv3 00/19] x86/ftrace/bpf: Add batch support for direct/tracing attach

Yonghong Song <yhs@xxxxxx> · Sat, 19 Jun 2021 09:19:57 -0700

On 6/19/21 1:33 AM, Jiri Olsa wrote:
On Thu, Jun 17, 2021 at 01:29:45PM -0700, Andrii Nakryiko wrote:
On Sat, Jun 5, 2021 at 4:12 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:

hi,
saga continues.. ;-) previous post is in here [1]

After another discussion with Steven, he mentioned that if we fix
the ftrace graph problem with direct functions, he'd be open to
add batch interface for direct ftrace functions.

He already had prove of concept fix for that, which I took and broke
up into several changes. I added the ftrace direct batch interface
and bpf new interface on top of that.

It's not so many patches after all, so I thought having them all
together will help the review, because they are all connected.
However I can break this up into separate patchsets if necessary.

This patchset contains:

   1) patches (1-4) that fix the ftrace graph tracing over the function
      with direct trampolines attached
   2) patches (5-8) that add batch interface for ftrace direct function
      register/unregister/modify
   3) patches (9-19) that add support to attach BPF program to multiple
      functions

In nutshell:

Ad 1) moves the graph tracing setup before the direct trampoline
prepares the stack, so they don't clash

Ad 2) uses ftrace_ops interface to register direct function with
all functions in ftrace_ops filter.

Ad 3) creates special program and trampoline type to allow attachment
of multiple functions to single program.

There're more detailed desriptions in related changelogs.

I have working bpftrace multi attachment code on top this. I briefly
checked retsnoop and I think it could use the new API as well.

Ok, so I had a bit of time and enthusiasm to try that with retsnoop.
The ugly code is at [0] if you'd like to see what kind of changes I
needed to make to use this (it won't work if you check it out because
it needs your libbpf changes synced into submodule, which I only did
locally). But here are some learnings from that experiment both to
emphasize how important it is to make this work and how restrictive
are some of the current limitations.

First, good news. Using this mass-attach API to attach to almost 1000
kernel functions goes from

Plain fentry/fexit:
===================
real    0m27.321s
user    0m0.352s
sys     0m20.919s

to

Mass-attach fentry/fexit:
=========================
real    0m2.728s
user    0m0.329s
sys     0m2.380s

I did not meassured the bpftrace speedup, because the new code
attached instantly ;-)

It's a 10x speed up. And a good chunk of those 2.7 seconds is in some
preparatory steps not related to fentry/fexit stuff.

It's not exactly apples-to-apples, though, because the limitations you
have right now prevents attaching both fentry and fexit programs to
the same set of kernel functions. This makes it pretty useless for a

hum, you could do link_update with fexit program on the link fd,
like in the selftest, right?

lot of cases, in particular for retsnoop. So I haven't really tested
retsnoop end-to-end, I only verified that I do see fentries triggered,
but can't have matching fexits. So the speed-up might be smaller due
to additional fexit mass-attach (once that is allowed), but it's still
a massive difference. So we absolutely need to get this optimization
in.

Few more thoughts, if you'd like to plan some more work ahead ;)

1. We need similar mass-attach functionality for kprobe/kretprobe, as
there are use cases where kprobe are more useful than fentry (e.g., >6
args funcs, or funcs with input arguments that are not supported by
BPF verifier, like struct-by-value). It's not clear how to best
represent this, given currently we attach kprobe through perf_event,
but we'll need to think about this for sure.

I'm fighting with the '2 trampolines concept' at the moment, but the
mass attach for kprobes seems interesting ;-) will check

2. To make mass-attach fentry/fexit useful for practical purposes, it
would be really great to have an ability to fetch traced function's
IP. I.e., if we fentry/fexit func kern_func_abc, bpf_get_func_ip()
would return IP of that functions that matches the one in
/proc/kallsyms. Right now I do very brittle hacks to do that.

so I hoped that we could store ip always in ctx-8 and have
the bpf_get_func_ip helper to access that, but the BPF_PROG
macro does not pass ctx value to the program, just args

ctx does pass to the bpf program. You can check BPF_PROG
macro definition.

we could perhaps somehow store the ctx in BPF_PROG before calling
the bpf program, but I did not get to try that yet

So all-in-all, super excited about this, but I hope all those issues
are addressed to make retsnoop possible and fast.

   [0] https://github.com/anakryiko/retsnoop/commit/8a07bc4d8c47d025f755c108f92f0583e3fda6d8

thanks for checking on this,
jirka