On Wed, Nov 13, 2019 at 09:47:35PM +0100, Björn Töpel wrote: > From: Björn Töpel <bjorn.topel@xxxxxxxxx> > > The BPF dispatcher builds on top of the BPF trampoline ideas; > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate > code. The dispatcher builds a dispatch table for XDP programs, for > retpoline avoidance. The table is a simple binary search model, so > lookup is O(log n). Here, the dispatch table is limited to four > entries (for laziness reason -- only 1B relative jumps :-P). If the > dispatch table is full, it will fallback to the retpoline path. > > An example: A module/driver allocates a dispatcher. The dispatcher is > shared for all netdevs. Each netdev allocate a slot in the dispatcher > and a BPF program. The netdev then uses the dispatcher to call the > correct program with a direct call (actually a tail-call). > > Signed-off-by: Björn Töpel <bjorn.topel@xxxxxxxxx> > --- > arch/x86/net/bpf_jit_comp.c | 96 ++++++++++++++++++ > kernel/bpf/Makefile | 1 + > kernel/bpf/dispatcher.c | 197 ++++++++++++++++++++++++++++++++++++ > 3 files changed, 294 insertions(+) > create mode 100644 kernel/bpf/dispatcher.c > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index 28782a1c386e..d75aebf508b8 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -10,10 +10,12 @@ > #include <linux/if_vlan.h> > #include <linux/bpf.h> > #include <linux/memory.h> > +#include <linux/sort.h> > #include <asm/extable.h> > #include <asm/set_memory.h> > #include <asm/nospec-branch.h> > #include <asm/text-patching.h> > +#include <asm/asm-prototypes.h> > > static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len) > { > @@ -1471,6 +1473,100 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags > return 0; > } > > +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) > + > +/* Emits the dispatcher. Id lookup is limited to BPF_DISPATCHER_MAX, > + * so it'll fit into PAGE_SIZE/2. The lookup is binary search: O(log > + * n). > + */ > +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, u64 *progs, > + u8 *fb) > +{ > + u8 *prog = *pprog, *jg_reloc; > + int pivot, err, cnt = 0; > + s64 jmp_offset; > + > + if (a == b) { > + emit_mov_imm64(&prog, BPF_REG_0, /* movabs func,%rax */ > + progs[a] >> 32, > + (progs[a] << 32) >> 32); Could you try optimizing emit_mov_imm64() to recognize s32 ? iirc there was a single x86 insns that could move and sign extend. That should cut down on bytecode size and probably make things a bit faster? Another alternative is compare lower 32-bit only, since on x86-64 upper 32 should be ~0 anyway for bpf prog pointers. Looking at bookkeeping code, I think I should be able to generalize bpf trampoline a bit and share the code for bpf dispatch. Could you also try aligning jmp target a bit by inserting nops? Some x86 cpus are sensitive to jmp target alignment. Even without considering JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP here that will be pushing millions of calls through bpf dispatch.