Re: [PATCH bpf-next] RFC: libbpf: resolve rodata lookups

Stanislav Fomichev <sdf@xxxxxxxxxx> · Thu, 21 Jul 2022 16:16:15 -0700

On Thu, Jul 21, 2022 at 3:30 PM Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 7/20/22 3:33 PM, Stanislav Fomichev wrote:
> > On Wed, Jul 20, 2022 at 1:53 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> >>
> >> On Wed, Jul 20, 2022 at 11:04:59AM -0700, sdf@xxxxxxxxxx wrote:
> >>> On 07/19, sdf@xxxxxxxxxx wrote:
> >>>> On 07/19, Alexei Starovoitov wrote:
> >>>>> On Tue, Jul 19, 2022 at 2:41 PM Stanislav Fomichev <sdf@xxxxxxxxxx>
> >>>> wrote:
> >>>>>>
> >>>>>> On Tue, Jul 19, 2022 at 1:33 PM Stanislav Fomichev <sdf@xxxxxxxxxx>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> On Tue, Jul 19, 2022 at 1:21 PM Alexei Starovoitov
> >>>>>>> <alexei.starovoitov@xxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> On Mon, Jul 18, 2022 at 12:07 PM Stanislav Fomichev
> >>>>> <sdf@xxxxxxxxxx> wrote:
> >>>>>>>>>
> >>>>>>>>> Motivation:
> >>>>>>>>>
> >>>>>>>>> Our bpf programs have a bunch of options which are set at the
> >>>>> loading
> >>>>>>>>> time. After loading, they don't change. We currently use array
> >>>> map
> >>>>>>>>> to store them and bpf program does the following:
> >>>>>>>>>
> >>>>>>>>> val = bpf_map_lookup_elem(&config_map, &key);
> >>>>>>>>> if (likely(val && *val)) {
> >>>>>>>>>    // do some optional feature
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> Since the configuration is static and we have a lot of those
> >>>>> features,
> >>>>>>>>> I feel like we're wasting precious cycles doing dynamic lookups
> >>>>>>>>> (and stalling on memory loads).
> >>>>>>>>>
> >>>>>>>>> I was assuming that converting those to some fake kconfig
> >>>> options
> >>>>>>>>> would solve it, but it still seems like kconfig is stored in the
> >>>>>>>>> global map and kconfig entries are resolved dynamically.
> >>>>>>>>>
> >>>>>>>>> Proposal:
> >>>>>>>>>
> >>>>>>>>> Resolve kconfig options statically upon loading. Basically
> >>>> rewrite
> >>>>>>>>> ld+ldx to two nops and 'mov val, x'.
> >>>>>>>>>
> >>>>>>>>> I'm also trying to rewrite conditional jump when the condition
> >>>> is
> >>>>>>>>> !imm. This seems to be catching all the cases in my program, but
> >>>>>>>>> it's probably too hacky.
> >>>>>>>>>
> >>>>>>>>> I've attached very raw RFC patch to demonstrate the idea.
> >>>> Anything
> >>>>>>>>> I'm missing? Any potential problems with this approach?
> >>>>>>>>
> >>>>>>>> Have you considered using global variables for that?
> >>>>>>>> With skeleton the user space has a natural way to set
> >>>>>>>> all of these knobs after doing skel_open and before skel_load.
> >>>>>>>> Then the verifier sees them as readonly vars and
> >>>>>>>> automatically converts LDX into fixed constants and if the code
> >>>>>>>> looks like if (my_config_var) then the verifier will remove
> >>>>>>>> all the dead code too.
> >>>>>>>
> >>>>>>> Hm, that's a good alternative, let me try it out. Thanks!
> >>>>>>
> >>>>>> Turns out we already freeze kconfig map in libbpf:
> >>>>>> if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG) {
> >>>>>>          err = bpf_map_freeze(map->fd);
> >>>>>>
> >>>>>> And I've verified that I do hit bpf_map_direct_read in the verifier.
> >>>>>>
> >>>>>> But the code still stays the same (bpftool dump xlated):
> >>>>>>    72: (18) r1 = map[id:24][0]+20
> >>>>>>    74: (61) r1 = *(u32 *)(r1 +0)
> >>>>>>    75: (bf) r2 = r9
> >>>>>>    76: (b7) r0 = 0
> >>>>>>    77: (15) if r1 == 0x0 goto pc+9
> >>>>>>
> >>>>>> I guess there is nothing for sanitize_dead_code to do because my
> >>>>>> conditional is "if (likely(some_condition)) { do something }" and the
> >>>>>> branch instruction itself is '.seen' by the verifier.
> >>>
> >>>>> I bet your variable is not 'const'.
> >>>>> Please see any of the progs in selftests that do:
> >>>>> const volatile int var = 123;
> >>>>> to express configs.
> >>>
> >>>> Yeah, I was testing against the following:
> >>>
> >>>>      extern int CONFIG_XYZ __kconfig __weak;
> >>>
> >>>> But ended up writing this small reproducer:
> >>>
> >>>>      struct __sk_buff;
> >>>
> >>>>      const volatile int CONFIG_DROP = 1; // volatile so it's not
> >>>>                                          // clang-optimized
> >>>
> >>>>      __attribute__((section("tc"), used))
> >>>>      int my_config(struct __sk_buff *skb)
> >>>>      {
> >>>>              int ret = 0; /*TC_ACT_OK*/
> >>>
> >>>>              if (CONFIG_DROP)
> >>>>                      ret = 2 /*TC_ACT_SHOT*/;
> >>>
> >>>>              return ret;
> >>>>      }
> >>>
> >>>> $ bpftool map dump name my_confi.rodata
> >>>
> >>>> [{
> >>>>           "value": {
> >>>>               ".rodata": [{
> >>>>                       "CONFIG_DROP": 1
> >>>>                   }
> >>>>               ]
> >>>>           }
> >>>>       }
> >>>> ]
> >>>
> >>>> $ bpftool prog dump xlated name my_config
> >>>
> >>>> int my_config(struct __sk_buff * skb):
> >>>> ; if (CONFIG_DROP)
> >>>>      0: (18) r1 = map[id:3][0]+0
> >>>>      2: (61) r1 = *(u32 *)(r1 +0)
> >>>>      3: (b4) w0 = 1
> >>>> ; if (CONFIG_DROP)
> >>>>      4: (64) w0 <<= 1
> >>>> ; return ret;
> >>>>      5: (95) exit
> >>>
> >>>> The branch is gone, but the map lookup is still there :-(
> >>>
> >>> Attached another RFC below which is doing the same but from the verifier
> >>> side. It seems we should be able to resolve LD+LDX if their dst_reg
> >>> is the same? If they are different, we should be able to pre-lookup
> >>> LDX value at least. Would something like this work (haven't run full
> >>> verifier/test_progs yet)?
> >>>
> >>> (note, in this case, with kconfig, I still see the branch)
> >>>
> >>>   test_fold_ro_ldx:PASS:open 0 nsec
> >>>   test_fold_ro_ldx:PASS:load 0 nsec
> >>>   test_fold_ro_ldx:PASS:bpf_obj_get_info_by_fd 0 nsec
> >>>   int fold_ro_ldx(struct __sk_buff * skb):
> >>>   ; if (CONFIG_DROP)
> >>>      0: (b7) r1 = 1
> >>>      1: (b4) w0 = 1
> >>>   ; if (CONFIG_DROP)
> >>>      2: (16) if w1 == 0x0 goto pc+1
> >>>      3: (b4) w0 = 2
> >>>   ; return ret;
> >>>      4: (95) exit
> >>>   test_fold_ro_ldx:PASS:found BPF_LD 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LDX 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LD 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LDX 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LD 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LDX 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LD 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LDX 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LD 0 nsec
> >>>   test_fold_ro_ldx:PASS:found BPF_LDX 0 nsec
> >>>   #66      fold_ro_ldx:OK
> >>>
> >>> Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> >>> ---
> >>>   kernel/bpf/verifier.c                         | 74 ++++++++++++++++++-
> >>>   .../selftests/bpf/prog_tests/fold_ro_ldx.c    | 52 +++++++++++++
> >>>   .../testing/selftests/bpf/progs/fold_ro_ldx.c | 20 +++++
> >>>   3 files changed, 144 insertions(+), 2 deletions(-)
> >>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/fold_ro_ldx.c
> >>>   create mode 100644 tools/testing/selftests/bpf/progs/fold_ro_ldx.c
> >>>
> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>> index c59c3df0fea6..ffedd8234288 100644
> >>> --- a/kernel/bpf/verifier.c
> >>> +++ b/kernel/bpf/verifier.c
> >>> @@ -12695,6 +12695,69 @@ static bool bpf_map_is_cgroup_storage(struct
> >>> bpf_map *map)
> >>>                map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE);
> >>>   }
> >>>
> >>> +/* if the map is read-only, we can try to fully resolve the load */
> >>> +static bool fold_ro_pseudo_ldimm64(struct bpf_verifier_env *env,
> >>> +                                struct bpf_map *map,
> >>> +                                struct bpf_insn *insn)
> >>> +{
> >>> +     struct bpf_insn *ldx_insn = insn + 2;
> >>> +     int dst_reg = ldx_insn->dst_reg;
> >>> +     u64 val = 0;
> >>> +     int size;
> >>> +     int err;
> >>> +
> >>> +     if (!bpf_map_is_rdonly(map) || !map->ops->map_direct_value_addr)
> >>> +             return false;
> >>> +
> >>> +     /* 0: BPF_LD  r=MAP
> >>> +      * 1: BPF_LD  r=MAP
> >>> +      * 2: BPF_LDX r=MAP->VAL
> >>> +      */
> >>> +
> >>> +     if (BPF_CLASS((insn+0)->code) != BPF_LD ||
> >>> +         BPF_CLASS((insn+1)->code) != BPF_LD ||
> >>> +         BPF_CLASS((insn+2)->code) != BPF_LDX)
> >>> +             return false;
> >>> +
> >>> +     if (BPF_MODE((insn+0)->code) != BPF_IMM ||
> >>> +         BPF_MODE((insn+1)->code) != BPF_IMM ||
> >>> +         BPF_MODE((insn+2)->code) != BPF_MEM)
> >>> +             return false;
> >>> +
> >>> +     if (insn->src_reg != BPF_PSEUDO_MAP_VALUE &&
> >>> +         insn->src_reg != BPF_PSEUDO_MAP_IDX_VALUE)
> >>> +             return false;
> >>> +
> >>> +     if (insn->dst_reg != ldx_insn->src_reg)
> >>> +             return false;
> >>> +
> >>> +     if (ldx_insn->off != 0)
> >>> +             return false;
> >>> +
> >>> +     size = bpf_size_to_bytes(BPF_SIZE(ldx_insn->code));
> >>> +     if (size < 0 || size > 4)
> >>> +             return false;
> >>> +
> >>> +     err = bpf_map_direct_read(map, (insn+1)->imm, size, &val);
> >>> +     if (err)
> >>> +             return false;
> >>> +
> >>> +     if (insn->dst_reg == ldx_insn->dst_reg) {
> >>> +             /* LDX is using the same destination register as LD.
> >>> +              * This means we are not interested in the map
> >>> +              * pointer itself and can remove it.
> >>> +              */
> >>> +             *(insn + 0) = BPF_JMP_A(0);
> >>> +             *(insn + 1) = BPF_JMP_A(0);
> >>> +             *(insn + 2) = BPF_ALU64_IMM(BPF_MOV, dst_reg, val);
> >> Have you figured out why the branch is not removed
> >> with BPF_ALU64_IMM(BPF_MOV) ?
> >
> > I do have an idea, yes, but I'm not 100% certain. The rewrite has
> > nothing to do with it.
> > If I change the default ret from 1 to 0, I get a different bytecode
> > where this branch is eventually removed by the verifier.
> >
> > I think it comes down to the following snippet:
> > r1 = 0 ll
> > r1 = *(u32 *)(r1 + 0) <<< rodata, verifier is able to resolve to r1 = 1
> > w0 = 1
> > if w1 == 0 goto +1
> > w0 = 2
> > exit
> > Here, 'if w1 == 0' is never taken, but it has been 'seen' by the
> > verifier. There is no 'dead' code, it just jumps around 'w0 = 2' which
> > has seen=true.
> >
> > VS this one:
> > r1 = 0 ll
> > r1 = *(u32 *)(r1 + 0) <<< rodata, verifier is able to resolve to r1 = 1
> > w0 = 1
> > if w1 != 0 goto +1
> > w0 = 0
> > w0 <<= 1
> > exit
> > Here, 'if w1 != 0' is seen, but the next insn has 'seen=false', so
> > this whole thing is ripped out.
> >
> > So basically, from my quick look, it seems like there should be some
> > real dead code to trigger the removal. If you simply have 'if
> > condition_that_never_happens goto +1' which doesn't lead to some dead
> > code, this single jump-over-some-seen-code is never gonna be removed.
> > So, IMO, that's something that should be addressed independently.
> >
> >> Can it also support 8 bytes (BPF_DW) ?  Is it because there
> >> is not enough space for ld_imm64?  so wonder if this
> >> patching can be done in do_misc_fixups() instead.
> >
> > Yeah, I've limited it to 4 bytes because of sizeof(imm) for now.
> > I think one complication might be that at do_misc_fixups point, the
> > immediate args of bpf_ld have been rewritten which might make it
> > harder to get the data offset.
> >
> > But I guess I'm still at the point where I'm trying to understand
> > whether what I'm doing makes sense or not :-) And whether we should do
> > it at the verifier level, in libbpf or if at all..
>
> I think the approach above is a little bit fragile.
> There is no guarantee that ldx immediately after ld.
> Also, after rewrite, some redundant instructions still
> left (e.g., map pointer load) although it can be rewritten
> as a nop.

Yeah, it's possible there is no ldx, or it's possible that map's ptr
in ld is used by some other instruction later on. I agree that all
this rewriting is still best case and might be a bit ugly :-(

> In general, 'const volatile int var' should be good enough
> although it still have one extra load instruction. How much
> did you see performance hit with this extra load?

I haven't tested anything yet. I think I'll try to convert my prog to
kconfig and then have:

#ifdef STATIC_CONFIG
const int CONFIG_XYZ = 1;
#else
extern const int CONFIG_XYZ __kconfig __weak;
#endif

And maybe roll that STATIC_CONFIG=true version somewhere for a week or
two to see whether there is any effect.

> If we truely want to generate best code (no libbpf/verifier
> rewrite with nop's) robustly, we can implement this in llvm as
> a CO-RE relocation. In such cases, compiler will be
> able to generate
>       r = <patchable_value>
>       ... using r ...
>
> So in the above code, the CO-RE based approach will generate:
>    r1 = <patchable value>  // we can make it always 64bit value
>    w0 = 1
>    if w1 != 0 goto +1
>    w0 = 0
>    w0 <<= 1
>    exit
>
> This will only work for upto 64bit int types.
> The kconfig itself supports array as well for which
> CO-RE won't work.

What is the criteria for emitting a relocation in this case? Something
like a __kconfig tag?
In my initial attempt, I've been doing these rewrites on the libbpf
side, so maybe that's the way to go if we can emit proper relocations
for ld+ldx?

[1] https://lore.kernel.org/bpf/7b2e6553-5bcc-1c70-2c4b-78e95593755b@xxxxxx/T/#m5efdc6672ff3da98f806375381d5e055060cbe54

> >
> >
> >
> >
> >>> +             return true;
> >>> +     }
> >>> +
> >>> +     *(insn + 2) = BPF_ALU64_IMM(BPF_MOV, dst_reg, val);
> >>> +     /* Only LDX can be resolved, we still have to resolve LD address. */
> >>> +     return false;
> >>> +}