Re: [tip:x86/urgent] bpf: Fix ORC unwinding in non-JIT BPF code

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 8 Jul 2019 15:49:33 -0700

On Mon, Jul 8, 2019 at 3:38 PM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Mon, Jul 08, 2019 at 03:15:37PM -0700, Alexei Starovoitov wrote:
> > > 2)
> > >
> > >   After doing the first optimization, GCC then does another one which is
> > >   a little trickier.  It replaces:
> > >
> > >         select_insn:
> > >                 jmp *jumptable(, %rax, 8)
> > >                 ...
> > >         ALU64_ADD_X:
> > >                 ...
> > >                 jmp *jumptable(, %rax, 8)
> > >         ALU_ADD_X:
> > >                 ...
> > >                 jmp *jumptable(, %rax, 8)
> > >
> > >   with
> > >
> > >         select_insn:
> > >                 mov jumptable, %r12
> > >                 jmp *(%r12, %rax, 8)
> > >                 ...
> > >         ALU64_ADD_X:
> > >                 ...
> > >                 jmp *(%r12, %rax, 8)
> > >         ALU_ADD_X:
> > >                 ...
> > >                 jmp *(%r12, %rax, 8)
> > >
> > >   The problem is that it only moves the jumptable address into %r12
> > >   once, for the entire function, then it goes through multiple recursive
> > >   indirect jumps which rely on that %r12 value.  But objtool isn't yet
> > >   smart enough to be able to track the value across multiple recursive
> > >   indirect jumps through the jump table.
> > >
> > >   After some digging I found that the quick and easy fix is to disable
> > >   -fgcse.  In fact, this seems to be recommended by the GCC manual, for
> > >   code like this:
> > >
> > >     -fgcse
> > >         Perform a global common subexpression elimination pass.  This
> > >         pass also performs global constant and copy propagation.
> > >
> > >         Note: When compiling a program using computed gotos, a GCC
> > >         extension, you may get better run-time performance if you
> > >         disable the global common subexpression elimination pass by
> > >         adding -fno-gcse to the command line.
> > >
> > >         Enabled at levels -O2, -O3, -Os.
> > >
> > >   This code indeed relies extensively on computed gotos.  I don't know
> > >   *why* disabling this optimization would improve performance.  In fact
> > >   I really don't see how it could make much of a difference either way.
> > >
> > >   Anyway, using -fno-gcse makes optimization #2 go away and makes
> > >   objtool happy, with only a fix for #1 needed.
> > >
> > >   If -fno-gcse isn't an option, we might be able to fix objtool by using
> > >   the "first_jump_src" thing which Peter added, improving it such that
> > >   it also takes table jumps into account.
> >
> > Sorry for delay. I'm mostly offgrid until next week.
> > As far as -fno-gcse.. I don't mind as long as it doesn't hurt performance.
> > Which I suspect it will :(
> > All these indirect gotos are there for performance.
> > Single indirect goto and a bunch of jmp select_insn
> > are way slower, since there is only one instruction
> > for cpu branch predictor to work with.
> > When every insn is followed by "jmp *jumptable"
> > there is more room for cpu to speculate.
> > It's been long time, but when I wrote it the difference
> > between all indirect goto vs single indirect goto was almost 2x.
>
> Just to clarify, -fno-gcse doesn't get rid of any of the indirect jumps.
> It still has 166 indirect jumps.  It just gets rid of the second
> optimization, where the jumptable address is placed in a register.

what about other functions in core.c ?
May be it's easier to teach objtool to recognize that pattern?

> If you have a benchmark which is relatively easy to use, I could try to
> run some tests.

modprobe test_bpf
selftests/bpf/test_progs
both print runtime.
Some of test_progs have high run-to-run variations though.