On Sun, 24 Nov 2019 at 02:55, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote: > > + > > + err = emit_jump(&prog, /* jmp thunk */ > > + __x86_indirect_thunk_rdx, prog); > > could you please add a comment that this is gcc specific and gate it > by build_bug_on ? > I think even if compiler stays the change of flags: > RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register > may change the name of this helper? > I wonder whether it's possible to make it compiler independent. > > > diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c > > new file mode 100644 > > index 000000000000..385dd76ab6d2 > > --- /dev/null > > +++ b/kernel/bpf/dispatcher.c > > @@ -0,0 +1,208 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright(c) 2019 Intel Corporation. */ > > + > > +#ifdef CONFIG_RETPOLINE > > I'm worried that such strong gating will make the code rot. Especially it's not > covered by selftests. > Could you please add xdp_call_run() to generic xdp and add a selftest ? > Also could you please benchmark it without retpoline? > iirc direct call is often faster than indirect, so I suspect this optimization > may benefit non-mitigated kernels. > > > +#define DISPATCHER_HASH_BITS 10 > > +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS) > > + > > +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE]; > > there is one DEFINE_XDP_CALL per driver, so total number of such > dispatch routines is pretty small. 1<<10 hash table is overkill. > The hash table itself is overkill :) > > How about adding below: > > > +#define BPF_DISPATCHER_MAX 16 > > + > > +struct bpf_dispatcher { > > + struct hlist_node hlist; > > + void *func; > > + struct bpf_prog *progs[BPF_DISPATCHER_MAX]; > > + int num_progs; > > + void *image; > > + u64 selector; > > +}; > > without hlist and without func to DEFINE_XDP_CALL() macro? > Then bpf_dispatcher_lookup() will become bpf_dispatcher_init() > and the rest will become a bit simpler? > > > + > > + set_vm_flush_reset_perms(image); > > + set_memory_x((long)image, 1); > > + d->image = image; > > Can you add a common helper for this bit to share between > bpf dispatch and bpf trampoline? > > > +static void bpf_dispatcher_update(struct bpf_dispatcher *d) > > +{ > > + void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2; > > + void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2; > > + s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0]; > > + int i, err; > > + > > + if (!d->num_progs) { > > + bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP, > > + old_image, NULL); > > + return; > > how does it work? Without doing d->selector = 0; the next addition > will try to do JUMP_TO_JUMP and will fail... > > > + } > > + > > + for (i = 0; i < BPF_DISPATCHER_MAX; i++) { > > + if (d->progs[i]) > > + *ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func; > > + } > > + err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs); > > + if (err) > > + return; > > + > > + if (d->selector) { > > + /* progs already running at this address */ > > + err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP, > > + old_image, new_image); > > + } else { > > + /* first time registering */ > > + err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP, > > + NULL, new_image); > > + } > > + if (err) > > + return; > > + d->selector++; > > +} > > Not sure how to share selector logic between dispatch and trampoline. > But above selector=0; weirdness is a sign that sharing is probably necessary? > > > + > > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from, > > + struct bpf_prog *to) > > +{ > > + struct bpf_dispatcher *d; > > + bool changed = false; > > + > > + if (from == to) > > + return; > > + > > + mutex_lock(&dispatcher_mutex); > > + d = bpf_dispatcher_lookup(func); > > + if (!d) > > + goto out; > > + > > + changed |= bpf_dispatcher_remove_prog(d, from); > > + changed |= bpf_dispatcher_add_prog(d, to); > > + > > + if (!changed) > > + goto out; > > + > > + bpf_dispatcher_update(d); > > + if (!d->num_progs) > > + bpf_dispatcher_free(d); > > I think I got it why it works. > Every time the prog cnt goes to zero you free the trampoline right away > and next time it will be allocated again and kzalloc() will zero selector. > That's hard to spot. > Also if user space does for(;;) attach/detach; > it will keep stressing bpf_jit_alloc_exec. > In case of bpf trampoline attach/detach won't be stressing it. > Only load/unload which are much slower due to verification. > I guess such difference is ok. > Alexei, thanks for all feedback (on the weekend)! I agree with all of above, and especially missing selftests and too much code duplication. I'll do a respin, but that'll be in the next window, given that Linus will (probably) tag the release today. Björn