one prog multi fentry. Was: [PATCH bpf-next] libbpf: support module BTF for BPF_TYPE_ID_TARGET CO-RE relocation

Alexei Starovoitov <ast@xxxxxx> · Tue, 15 Dec 2020 14:38:58 -0800

On Wed, Dec 09, 2020 at 11:21:43PM +0000, Alan Maguire wrote:
Right, that's exactly it.  A pair of generic tracing BPF programs are
used, and they attach to kprobe/kretprobes, and when they run they
use the arguments plus the map details about BTF ids of those 
arguments to run bpf_snprintf_btf(), and send perf events to 
userspace containing the results.
...
That would be fantastic! We could do that from the context passed 
into a kprobe program as the IP in struct pt_regs points at the 
function. kretprobes seems a bit trickier as in that case the IP in 
struct pt_regs is actually set to kretprobe_trampoline rather than 
the function we're returning from due to how kretprobes work; maybe 
there's another way to get it in that case though..

Yeah. kprobe's IP doesn't match kretprobe's IP which makes such tracing
use cases more complicated. Also kretprobe is quite slow. See
prog_tests/test_overhead and selftests/bpf/bench.
I think the key realization is that the user space knows all IPs
it will attach to. It has to know all IPs otherwise
hashmap{key=ip, value=btf_data} is not possible.
Obvious, right ? What it means that we can use this key observation
to build better interfaces at all layers. kprobes are slow to
setup one by one. It's also slow to execute. fentry/fexit is slow
to setup, but fast to execute. Jiri proposed a batching api for
fentry, but it doesn't quite make sense from api perspective
since user space has to give different bpf prog for every fentry.
bpf trampoline is unique for every target fentry kernel function.
The batched attach would make sense for kprobe because one prog
can be attached everywhere. But kprobe is slow.
This thought process justifies an addition of a new program
type where one program can attach to multiple fentry.
Since fentry ctx is no longer fixed the verifier won't be able to
track btf_id-s of arguments, but btf based pointer walking is fast
and powerful, so if btf is passed into the program there could
be a helper that does dynamic cast from long to PTR_TO_BTF_ID.
Since such new fentry prog will have btf in the context and
there will be no need for user space to populate hashmap and
mess with IPs. And the best part that batched attach will not
only be desired, but mandatory part of the api.
So I'm proposing to extend BPF_PROG_LOAD cmd with an array of
pairs (attach_obj_fd, attach_btf_id).
The fentry prog in .c file might even have a regex in attach pattern:
SEC("fentry/sys_*")
int BPF_PROG(test, struct btf *btf_obj, __u32 btf_id, __u64 arg1,
             __u64 arg2, ...__u64 arg6)
{
  struct btf_ptr ptr1 = {
    .ptr = arg1,
    .type_id = bpf_core_type_id_kernel(struct foo),
    .btf_obj = btf_obj,
  },
  ptr2 = {
    .ptr = arg2,
    .type_id = bpf_core_type_id_kernel(struct bar),
    .btf_obj = btf_obj,
  };
  bpf_snprintf_btf(,, &ptr1, sizeof(ptr1), );
  bpf_snprintf_btf(,, &ptr1, sizeof(ptr2), );
}

libbpf will process the attach regex and find all matching functions in
the kernel and in the kernel modules. Then it will pass this list of
(fd,btf_id) pairs to the kernel. The kernel will find IP addresses and
BTFs of all functions. It will generate single bpf trampoline to handle
all the functions. Either one trampoline or multiple trampolines is an
implementation detail. It could be one trampoline that does lookup based
on IP to find btf_obj, btf_id to pass into the program or multiple
trampolines that share most of the code with N unique trampoline
prefixes with hardcoded btf_obj, btf_id. The argument save/restore code
can be the same for all fentries. The same way we can support single
fexit prog attaching to multiple kernel functions. And even single
fmod_ret prog attaching to multiple. The batching part will make
attaching to thousands of functions efficient. We can use batched
text_poke_bp, etc.

As far as dynamic btf casting helper we could do something like this:
SEC("fentry/sys_*")
int BPF_PROG(test, struct btf *btf_obj, __u32 btf_id, __u64 arg1, __u64
arg2, ...__u64 arg6)
{
  struct sk_buff *skb;
  struct task_struct *task;

  skb = bpf_dynamic_cast(btf_obj, btf_id, 1, arg1,
                         bpf_core_type_id_kernel(skb));
  task = bpf_dynamic_cast(btf_obj, btf_id, 2, arg2,
                          bpf_core_type_id_kernel(task));
  skb->len + task->status;
}
The dynamic part of the helper will walk btf of func_proto that was
pointed to by 'btf_id' argument. It will find Nth argument and
if argument's btf_id matches the last u32 passed into bpf_dynamic_cast()
it will return ptr_to_btf_id. The verifier needs 5th u32 arg to know
const value of btf_id during verification.
The execution time of this casting helper will be pretty fast.
Thoughts?