Re: [PATCH v3 bpf-next 06/13] bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 22 Nov 2021 18:59:02 -0800

On Mon, Nov 22, 2021 at 5:44 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Nov 22, 2021 at 4:43 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Nov 22, 2021 at 3:47 PM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 7:33 PM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > >
> > > > Given BPF program's BTF perform a linear search through kernel BTFs for
> > > > a possible candidate.
> > > > Then wire the result into bpf_core_apply_relo_insn().
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > > ---
> > > >  kernel/bpf/btf.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 135 insertions(+), 1 deletion(-)
> > > >
> > >
> > > [...]
> > >
> > > >  int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
> > > >                    int relo_idx, void *insn)
> > > >  {
> > > > -       return -EOPNOTSUPP;
> > > > +       if (relo->kind != BPF_CORE_TYPE_ID_LOCAL) {
> > > > +               struct bpf_core_cand_list *cands;
> > > > +
> > > > +               cands = bpf_core_find_cands(ctx, relo->type_id);
> > >
> > > this is wrong for many reasons:
> > >
> > > 1. you will overwrite previous ctx->cands, if it was already set,
> > > which leaks memory
> > > 2. this list of candidates should be keyed by relo->type_id ("root
> > > type"). Different root types get their own independent lists; so it
> > > has to be some sort of look up table from type_id to a list of
> > > candidates.
> > >
> > > 2) means that if you had a bunch of relos against struct task_struct,
> > > you'll crate a list of candidates when processing first relo that
> > > starts at task_struct. All the subsequent relos that have task_struct
> > > as root type will re-used that list and potentially trim it down. If
> > > there are some other relos against, say, struct mm_struct, they will
> > > have their independent list of candidates.
> >
> > right.
> > Your prior comment confused me. I didn't do this reuse of cands
> > to avoid introducing hashtable here like libbpf does,
> > since it does too little to actually help.
>
> Since when avoiding millions of iterations for each relocation is "too
> little help"?
>
> BTW, I've tried to measure how noticeable this will be and added
> test_verif_scale2 to core_kern with only change to use vmlinux.h (so
> that __sk_buff accesses are CO-RE relocated). I didn't manage to get
> it loaded, so something else is going there. So please take a look, I
> don't really know how to debug lskel stuff. Here are the changes:
>
> diff --git a/tools/testing/selftests/bpf/progs/core_kern.c
> b/tools/testing/selftests/bpf/progs/core_kern.c
> index 3b4571d68369..9916cf059883 100644
> --- a/tools/testing/selftests/bpf/progs/core_kern.c
> +++ b/tools/testing/selftests/bpf/progs/core_kern.c
> @@ -4,6 +4,10 @@
>
>  #include <bpf/bpf_helpers.h>
>  #include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_core_read.h>
> +
> +#define ATTR __always_inline
> +#include "test_jhash.h"
>
>  struct {
>   __uint(type, BPF_MAP_TYPE_ARRAY);
> @@ -57,4 +61,27 @@ int BPF_PROG(fexit_eth_type_trans, struct sk_buff *skb,
>   return randmap(dev->ifindex + skb->len);
>  }
>
> +SEC("tc")
> +int balancer_ingress(struct __sk_buff *ctx)
> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> + void *ptr;
> + int ret = 0, nh_off, i = 0;
> +
> + nh_off = 14;
> +
> + /* pragma unroll doesn't work on large loops */
> +
> +#define C do { \
> + ptr = data + i; \
> + if (ptr + nh_off > data_end) \
> + break; \
> + ctx->tc_index = jhash(ptr, nh_off, ctx->cb[0] + i++); \
> + } while (0);
> +#define C30 C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;C;
> + C30;C30;C30; /* 90 calls */
> + return 0;
> +}
> +
>  char LICENSE[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/test_verif_scale2.c
> b/tools/testing/selftests/bpf/progs/test_verif_scale2.c
> index f024154c7be7..9e2c2a6954cb 100644
> --- a/tools/testing/selftests/bpf/progs/test_verif_scale2.c
> +++ b/tools/testing/selftests/bpf/progs/test_verif_scale2.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  // Copyright (c) 2019 Facebook
> -#include <linux/bpf.h>
> +#include "vmlinux.h"
>  #include <bpf/bpf_helpers.h>
>  #define ATTR __always_inline
>  #include "test_jhash.h"
>
>
> The version with libbpf does load successfully, while lskel one gives:
>
> #35 core_kern_lskel:FAIL
> test_core_kern_lskel:FAIL:open_and_load unexpected error: -22
>
> Same stuff with libbpf skeleton successfully loaded.
>
> If you manage to get it working, you'll have a BPF program 181 CO-RE
> relocations, let's see how noticeable it will be to do 181 iterations
> over ~2mln vmlinux BTF types.

Just realized that I was off by one order of magnitude, it's about
100K types, not 2 million. But the point stands, it is quite a lot.

>
> > I think I will go back to the prior version: linear search for every relo.
>
> I don't understand the resistance, the kernel has both rbtree and
> hashmaps, what's the problem just using that?
>
> > If we actually need to optimize this part of loading
> > we better do persistent cache of
> > name -> kernel btf_type-s
> > and reuse it across different programs.
>
> You can't reuse it because the same type name in a BPF object BTF can
> be resolved to different kernel types (e.g., if we still had
> ring_buffer name collision), so this is all per-BPF object (or at
> least per BPF program). We still have duplicated types in some more
> fuller kernel configurations, and it can always happen down the road.