On Mon, 25 Nov 2019 at 11:54, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote: > > From: Björn Töpel <bjorn.topel@xxxxxxxxx> > > > > The BPF dispatcher is a multiway branch code generator, mainly > > targeted for XDP programs. When an XDP program is executed via the > > bpf_prog_run_xdp(), it is invoked via an indirect call. With > > retpolines enabled, the indirect call has a substantial performance > > impact. The dispatcher is a mechanism that transform multiple indirect > > calls to direct calls, and therefore avoids the retpoline. The > > dispatcher is generated using the BPF JIT, and relies on text poking > > provided by bpf_arch_text_poke(). > > > > The dispatcher hijacks a trampoline function it via the __fentry__ nop > > of the trampoline. One dispatcher instance currently supports up to 16 > > dispatch points. This can be extended in the future. > > > > An example: A module/driver allocates a dispatcher. The dispatcher is > > shared for all netdevs. Each unique XDP program has a slot in the > > dispatcher, registered by a netdev. The netdev then uses the > > dispatcher to call the correct program with a direct call. > > > > Signed-off-by: Björn Töpel <bjorn.topel@xxxxxxxxx> > [...] > > +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs) > > +{ > > + u8 *jg_reloc, *jg_target, *prog = *pprog; > > + int pivot, err, jg_bytes = 1, cnt = 0; > > + s64 jg_offset; > > + > > + if (a == b) { > > + /* Leaf node of recursion, i.e. not a range of indices > > + * anymore. > > + */ > > + EMIT1(add_1mod(0x48, BPF_REG_3)); /* cmp rdx,func */ > > + if (!is_simm32(progs[a])) > > + return -1; > > + EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), > > + progs[a]); > > + err = emit_cond_near_jump(&prog, /* je func */ > > + (void *)progs[a], prog, > > + X86_JE); > > + if (err) > > + return err; > > + > > + err = emit_jump(&prog, /* jmp thunk */ > > + __x86_indirect_thunk_rdx, prog); > > + if (err) > > + return err; > > + > > + *pprog = prog; > > + return 0; > > + } > > + > > + /* Not a leaf node, so we pivot, and recursively descend into > > + * the lower and upper ranges. > > + */ > > + pivot = (b - a) / 2; > > + EMIT1(add_1mod(0x48, BPF_REG_3)); /* cmp rdx,func */ > > + if (!is_simm32(progs[a + pivot])) > > + return -1; > > + EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]); > > + > > + if (pivot > 2) { /* jg upper_part */ > > + /* Require near jump. */ > > + jg_bytes = 4; > > + EMIT2_off32(0x0F, X86_JG + 0x10, 0); > > + } else { > > + EMIT2(X86_JG, 0); > > + } > > + jg_reloc = prog; > > + > > + err = emit_bpf_dispatcher(&prog, a, a + pivot, /* emit lower_part */ > > + progs); > > + if (err) > > + return err; > > + > > + /* Intel 64 and IA-32 ArchitecturesOptimization Reference > > + * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding > > + * Rule 12. (M impact, H generality) All branch targets should > > + * be 16-byte aligned. > > Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual > than on the website [0]? :) Argh, no, you're right. Typo. Thanks! > Just wondering, in your IXIA tests, did you > see any noticeable slowdowns if you don't do the 16-byte alignments as > in the rest of the kernel [1,2]? > > [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf > [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries") > [2] https://lore.kernel.org/patchwork/patch/560050/ > Interesting! Thanks for the pointers. I'll do more benchmarking for the next rev, and hopefully we can leave the nops out. Let's see. Björn > > + */ > > + jg_target = PTR_ALIGN(prog, 16); > > + if (jg_target != prog) > > + emit_nops(&prog, jg_target - prog); > > + jg_offset = prog - jg_reloc; > > + emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes); > > + > > + err = emit_bpf_dispatcher(&prog, a + pivot + 1, /* emit upper_part */ > > + b, progs); > > + if (err) > > + return err; > > + > > + *pprog = prog; > > + return 0; > > +}