Re: BPF skels in perf .Re: [GIT PULL] perf tools changes for v6.4

Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> · Thu, 4 May 2023 16:07:29 -0300

Em Thu, May 04, 2023 at 11:50:07AM -0700, Andrii Nakryiko escreveu:
> On Thu, May 4, 2023 at 10:52 AM Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
> > Andrii, can you add some more information about the usage of vmlinux.h
> > instead of using kernel headers?

> I'll just say that vmlinux.h is not a hard requirement to build BPF
> programs, it's more a convenience allowing easy access to definitions
> of both UAPI and kernel-internal structures for tracing needs and
> marking them relocatable using BPF CO-RE machinery. Lots of real-world
> applications just check-in pregenerated vmlinux.h to avoid build-time
> dependency on up-to-date host kernel and such.

> If vmlinux.h generation and usage is causing issues, though, given
> that perf's BPF programs don't seem to be using many different kernel
> types, it might be a better option to just use UAPI headers for public
> kernel type definitions, and just define CO-RE-relocatable minimal
> definitions locally in perf's BPF code for the other types necessary.
> E.g., if perf needs only pid and tgid from task_struct, this would
> suffice:

> struct task_struct {
>     int pid;
>     int tgid;
> } __attribute__((preserve_access_index));

Yeah, that seems like a way better approach, no vmlinux involved, libbpf
CO-RE notices that task_struct changed from this two integers version
(of course) and does the relocation to where it is in the running kernel
by using /sys/kernel/btf/vmlinux.

I looked and the creation of vmlinux.h was introduced in:

commit 944138f048f7d7591ec7568c94b21de8df2724d4
Author: Namhyung Kim <namhyung@xxxxxxxxxx>
Date:   Thu Jul 1 14:12:27 2021 -0700

    perf stat: Enable BPF counter with --for-each-cgroup

    Recently bperf was added to use BPF to count perf events for various
    purposes.  This is an extension for the approach and targetting to
    cgroup usages.

    Unlike the other bperf, it doesn't share the events with other
    processes but it'd reduces unnecessary events (and the overhead of
    multiplexing) for each monitored cgroup within the perf session.

    When --for-each-cgroup is used with --bpf-counters, it will open
    cgroup-switches event per cpu internally and attach the new BPF
    program to read given perf_events and to aggregate the results for
    cgroups.  It's only called when task is switched to a task in a
    different cgroup.

    Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx>
    Acked-by: Song Liu <songliubraving@xxxxxx>
    Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
    Cc: Ian Rogers <irogers@xxxxxxxxxx>
    Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
    Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
    Cc: Stephane Eranian <eranian@xxxxxxxxxx>
    Link: http://lore.kernel.org/lkml/20210701211227.1403788-1-namhyung@xxxxxxxxxx
    Signed-off-by: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>

Which I think was the first BPF skel to access a kernel data structure,
yeah:

tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

For things like:

+static inline int get_cgroup_v1_idx(__u32 *cgrps, int size)
+{
+       struct task_struct *p = (void *)bpf_get_current_task();
+       struct cgroup *cgrp;
+       register int i = 0;
+       __u32 *elem;
+       int level;
+       int cnt;
+
+       cgrp = BPF_CORE_READ(p, cgroups, subsys[perf_event_cgrp_id], cgroup);
+       level = BPF_CORE_READ(cgrp, level);

So we can completely remove touching vmlinux from the perf building
process.

If we can get the revert of the patches making BPF skels to build by
default for v6.4 then we would do this work, test it thorougly and have
it available for v6.5.

Linus, would that be a way forward?

- Arnaldo

For reference, here is the definition for BPF_CORE_READ() from tools/lib/bpf/bpf_core_read.h

/*
 * BPF_CORE_READ() is used to simplify BPF CO-RE relocatable read, especially
 * when there are few pointer chasing steps.
 * E.g., what in non-BPF world (or in BPF w/ BCC) would be something like:
 *      int x = s->a.b.c->d.e->f->g;
 * can be succinctly achieved using BPF_CORE_READ as:
 *      int x = BPF_CORE_READ(s, a.b.c, d.e, f, g);
 *
 * BPF_CORE_READ will decompose above statement into 4 bpf_core_read (BPF
 * CO-RE relocatable bpf_probe_read_kernel() wrapper) calls, logically
 * equivalent to:
 * 1. const void *__t = s->a.b.c;
 * 2. __t = __t->d.e;
 * 3. __t = __t->f;
 * 4. return __t->g;
 *
 * Equivalence is logical, because there is a heavy type casting/preservation
 * involved, as well as all the reads are happening through
 * bpf_probe_read_kernel() calls using __builtin_preserve_access_index() to
 * emit CO-RE relocations.
 *
 * N.B. Only up to 9 "field accessors" are supported, which should be more
 * than enough for any practical purpose.
 */
#define BPF_CORE_READ(src, a, ...) ({                                       \
        ___type((src), a, ##__VA_ARGS__) __r;                               \
        BPF_CORE_READ_INTO(&__r, (src), a, ##__VA_ARGS__);                  \
        __r;                                                                \
})