Re: [PATCH v3 bpf-next 06/13] bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 22 Nov 2021 21:04:24 -0800

On Mon, Nov 22, 2021 at 7:44 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Nov 22, 2021 at 7:15 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Nov 22, 2021 at 5:44 PM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Mon, Nov 22, 2021 at 4:43 PM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Nov 22, 2021 at 3:47 PM Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, Nov 19, 2021 at 7:33 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > > >
> > > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > > > >
> > > > > > Given BPF program's BTF perform a linear search through kernel BTFs for
> > > > > > a possible candidate.
> > > > > > Then wire the result into bpf_core_apply_relo_insn().
> > > > > >
> > > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > > > > ---
> > > > > >  kernel/bpf/btf.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++-
> > > > > >  1 file changed, 135 insertions(+), 1 deletion(-)
> > > > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > >  int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
> > > > > >                    int relo_idx, void *insn)
> > > > > >  {
> > > > > > -       return -EOPNOTSUPP;
> > > > > > +       if (relo->kind != BPF_CORE_TYPE_ID_LOCAL) {
> > > > > > +               struct bpf_core_cand_list *cands;
> > > > > > +
> > > > > > +               cands = bpf_core_find_cands(ctx, relo->type_id);
> > > > >
> > > > > this is wrong for many reasons:
> > > > >
> > > > > 1. you will overwrite previous ctx->cands, if it was already set,
> > > > > which leaks memory
> > > > > 2. this list of candidates should be keyed by relo->type_id ("root
> > > > > type"). Different root types get their own independent lists; so it
> > > > > has to be some sort of look up table from type_id to a list of
> > > > > candidates.
> > > > >
> > > > > 2) means that if you had a bunch of relos against struct task_struct,
> > > > > you'll crate a list of candidates when processing first relo that
> > > > > starts at task_struct. All the subsequent relos that have task_struct
> > > > > as root type will re-used that list and potentially trim it down. If
> > > > > there are some other relos against, say, struct mm_struct, they will
> > > > > have their independent list of candidates.
> > > >
> > > > right.
> > > > Your prior comment confused me. I didn't do this reuse of cands
> > > > to avoid introducing hashtable here like libbpf does,
> > > > since it does too little to actually help.
> > >
> > > Since when avoiding millions of iterations for each relocation is "too
> > > little help"?
> >
> > because it is a premature optimization for a case that
> > may or may not be relevant.
> > If 180 sk_buff relocations somehow makes the loading too slow
> > 180 relocations of 180 different types would produce exactly
> > the same slow down and hashtable cache won't help.
>
> Likelihood of using 180 different root types in real application is
> very small. Using 180 relocations (either with explicit BPF_CORE_READ,
> field accesses in fentry, or just through always_inline helpers doing
> either and being inlined in multiple places) is very real in
> real-world non-trivial applications. And the cost of optimizing that
> in the kernel later is very high, you'll be waiting for a new kernel
> release to get everywhere to rely on this optimization. The cost of
> further optimizing this in libbpf is much smaller (and libbpf still
> did the optimization from the get go and I stand by that decision).
>
> If you think I'm making this up, we have one security-related BPF app
> with 1076 CO-RE relocations across 11 BPF programs. It's using 22
> different root types.

I suspect you're talking about selftests/bpf/profiler* tests.
bpftool prog load -L profile1.o
[  873.449749] nr_core_relo 106
[  873.614186] nr_core_relo 297
[  874.107470] nr_core_relo 19
[  874.144146] nr_core_relo 102
[  874.306462] nr_core_relo 258
[  874.692219] nr_core_relo 410
[  875.329652] nr_core_relo 238
[  875.689900] nr_core_relo 9

8 different progs with a bunch of core relos.
On a debug kernel with lockdep and kasan it takes 2.3 seconds to load
while kernel bpf_core_add_cands() is doing that loop
more than a thousand times.
libbpf takes 1.7 seconds.
So it's an extra 0.5 second due to the loop.

I haven't found the bug in lksel with core_kern.c + balancer_ingress yet.
But just doing balancer_ingress (test_verif_scale2) as lskel I get:
0m0.827s

while verif_scale2 is 6 seconds!

Turned out due to attr.prog_flags = BPF_F_TEST_RND_HI32
without it it's 0m0.574s.

So 0.25 sec is spent in the add_cands loop.

> >
> > > BTW, I've tried to measure how noticeable this will be and added
> > > test_verif_scale2 to core_kern with only change to use vmlinux.h (so
> > > that __sk_buff accesses are CO-RE relocated). I didn't manage to get
> > > it loaded, so something else is going there. So please take a look, I
> > > don't really know how to debug lskel stuff. Here are the changes:
> >
> > Looking. Thanks for the test.
> >
> > > > If we actually need to optimize this part of loading
> > > > we better do persistent cache of
> > > > name -> kernel btf_type-s
> > > > and reuse it across different programs.
> > >
> > > You can't reuse it because the same type name in a BPF object BTF can
> > > be resolved to different kernel types (e.g., if we still had
> > > ring_buffer name collision),
> >
> > well and the candidate list will have two kernel types with the same name.
> > Both kept in a cache.
> > so cache_lookup("ring_buffer") will return two kernel types.
> > That would be the case for all progs being loaded.
> > What am I missing?
>
> if there are two matching types with the same matching field but field
> offsets are different (and thus there is ambiguity), that's an error.
> So the correct (by current definition, at least) program has to result
> in one of such two incompatible ring_buffer types and only one. If
> there are multiple duplicates, though, (like task_struct and
> task_struct___2) they will have identical sets of fields at the same
> offsets, so both will remain possible candidates and that's not an
> error. But for actually two different types, there can be only one
> winner.

I'm not proposing to cache the result of refined bpf_core_cand_list
after bpf_core_apply_relo_insn() did its job.
I'm saying the kernel can cache the result of the add_cands loop across
vmlinux BTFs for all progs.
The bpf_core_apply_relo_insn() will be called with a copy of
bpf_core_cand_list from a cache.
I'm thinking to keep this cache in 'struct btf'