Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher

Björn Töpel <bjorn.topel@xxxxxxxxx> · Sun, 24 Nov 2019 07:55:07 +0100

On Sun, 24 Nov 2019 at 02:55, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
>
> could you please add a comment that this is gcc specific and gate it
> by build_bug_on ?
> I think even if compiler stays the change of flags:
> RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
> may change the name of this helper?
> I wonder whether it's possible to make it compiler independent.
>
> > diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> > new file mode 100644
> > index 000000000000..385dd76ab6d2
> > --- /dev/null
> > +++ b/kernel/bpf/dispatcher.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2019 Intel Corporation. */
> > +
> > +#ifdef CONFIG_RETPOLINE
>
> I'm worried that such strong gating will make the code rot. Especially it's not
> covered by selftests.
> Could you please add xdp_call_run() to generic xdp and add a selftest ?
> Also could you please benchmark it without retpoline?
> iirc direct call is often faster than indirect, so I suspect this optimization
> may benefit non-mitigated kernels.
>
> > +#define DISPATCHER_HASH_BITS 10
> > +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> > +
> > +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
>
> there is one DEFINE_XDP_CALL per driver, so total number of such
> dispatch routines is pretty small. 1<<10 hash table is overkill.
> The hash table itself is overkill :)
>
> How about adding below:
>
> > +#define BPF_DISPATCHER_MAX 16
> > +
> > +struct bpf_dispatcher {
> > +     struct hlist_node hlist;
> > +     void *func;
> > +     struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> > +     int num_progs;
> > +     void *image;
> > +     u64 selector;
> > +};
>
> without hlist and without func to DEFINE_XDP_CALL() macro?
> Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
> and the rest will become a bit simpler?
>
> > +
> > +     set_vm_flush_reset_perms(image);
> > +     set_memory_x((long)image, 1);
> > +     d->image = image;
>
> Can you add a common helper for this bit to share between
> bpf dispatch and bpf trampoline?
>
> > +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> > +{
> > +     void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> > +     void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> > +     s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> > +     int i, err;
> > +
> > +     if (!d->num_progs) {
> > +             bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> > +                                old_image, NULL);
> > +             return;
>
> how does it work? Without doing d->selector = 0; the next addition
> will try to do JUMP_TO_JUMP and will fail...
>
> > +     }
> > +
> > +     for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +             if (d->progs[i])
> > +                     *ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> > +     }
> > +     err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> > +     if (err)
> > +             return;
> > +
> > +     if (d->selector) {
> > +             /* progs already running at this address */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> > +                                      old_image, new_image);
> > +     } else {
> > +             /* first time registering */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> > +                                      NULL, new_image);
> > +     }
> > +     if (err)
> > +             return;
> > +     d->selector++;
> > +}
>
> Not sure how to share selector logic between dispatch and trampoline.
> But above selector=0; weirdness is a sign that sharing is probably necessary?
>
> > +
> > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> > +                             struct bpf_prog *to)
> > +{
> > +     struct bpf_dispatcher *d;
> > +     bool changed = false;
> > +
> > +     if (from == to)
> > +             return;
> > +
> > +     mutex_lock(&dispatcher_mutex);
> > +     d = bpf_dispatcher_lookup(func);
> > +     if (!d)
> > +             goto out;
> > +
> > +     changed |= bpf_dispatcher_remove_prog(d, from);
> > +     changed |= bpf_dispatcher_add_prog(d, to);
> > +
> > +     if (!changed)
> > +             goto out;
> > +
> > +     bpf_dispatcher_update(d);
> > +     if (!d->num_progs)
> > +             bpf_dispatcher_free(d);
>
> I think I got it why it works.
> Every time the prog cnt goes to zero you free the trampoline right away
> and next time it will be allocated again and kzalloc() will zero selector.
> That's hard to spot.
> Also if user space does for(;;) attach/detach;
> it will keep stressing bpf_jit_alloc_exec.
> In case of bpf trampoline attach/detach won't be stressing it.
> Only load/unload which are much slower due to verification.
> I guess such difference is ok.
>

Alexei, thanks for all feedback (on the weekend)! I agree with all of
above, and especially missing selftests and too much code duplication.

I'll do a respin, but that'll be in the next window, given that Linus
will (probably) tag the release today.

Björn