Re: [PATCH bpf-next v4 2/3] bpf,x64: use shrx/sarx/shlx when available

KP Singh <kpsingh@xxxxxxxxxx> · Fri, 7 Oct 2022 11:11:39 -0700

On Thu, Oct 6, 2022 at 8:14 PM Jie Meng <jmeng@xxxxxx> wrote:
>
> On Wed, Oct 05, 2022 at 09:11:01PM -0700, KP Singh wrote:
> > On Sat, Oct 1, 2022 at 10:12 PM Jie Meng <jmeng@xxxxxx> wrote:
> > >
> > > Instead of shr/sar/shl that implicitly use %cl, emit their more flexible
> > > alternatives provided in BMI2 when advantageous; keep using the non BMI2
> > > instructions when shift count is already in BPF_REG_4/rcx as non BMI2
> > > instructions are shorter.
> >
> > This is confusing, you mention %CL in the first sentence and then RCX in the
> > second sentence. Can you clarify this more here?
>
> %cl is the lowest 8 bit of %rcx. In assembly, non BMI2 shifts with shift
> count in register is written as
>
>  SHR eax, cl
>
> Although the use of CL is mandatory and if the shift count is in another
> register it has to be moved into RCX first, unless of course when the
> shift count is already in BPF_REG_4, which is mapped to RCX in x86-64.
>
> It is indeed awkward but exactly what one would see in assembly: a MOV
> to RCX and a shift that uses CL as the source register.
>
> >
> > Also, It would be good to have some explanations about the
> > performance benefits here as well.
> >
> > i.e. a load + store + non vector instruction v/s a single vector instruction
> > omitting the load. How many cycles do we expect in each case, I do expect the
> > latter to be lesser, but mentioning it in the commit removes any ambiguity.
>
> Although it uses similar encoding as AVX instructions BMI2 actually
> operates on general purpose registers and no vector register is ever
> involved [1]. Inside a CPU all shifts instructions (both baseline and BMI2
> flavors) are almost always handled by the same units and have the same
> latency and throughput [2].
>
> [1] https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
> [2] https://www.agner.org/optimize/instruction_tables.pdf

Cool, please add this to the commit description.

> >
> > >
> > > To summarize, when BMI2 is available:
> > > -------------------------------------------------
> > >             |   arbitrary dst
> > > =================================================
> > > src == ecx  |   shl dst, cl
> > > -------------------------------------------------
> > > src != ecx  |   shlx dst, dst, src
> > > -------------------------------------------------
> > >
> > > A concrete example between non BMI2 and BMI2 codegen.  To shift %rsi by
> > > %rdi:
> > >
> > > Without BMI2:
> > >
> > >  ef3:   push   %rcx
> > >         51
> > >  ef4:   mov    %rdi,%rcx
> > >         48 89 f9
> > >  ef7:   shl    %cl,%rsi
> > >         48 d3 e6
> > >  efa:   pop    %rcx
> > >         59
> > >
> > > With BMI2:
> > >
> > >  f0b:   shlx   %rdi,%rsi,%rsi
> > >         c4 e2 c1 f7 f6
> > >
> > > Signed-off-by: Jie Meng <jmeng@xxxxxx>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 64 +++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 64 insertions(+)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index d9ba997c5891..d09c54f3d2e0 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -889,6 +889,48 @@ static void emit_nops(u8 **pprog, int len)
> > >         *pprog = prog;
> > >  }
> > >
> > > +/* emit the 3-byte VEX prefix */
> > > +static void emit_3vex(u8 **pprog, bool r, bool x, bool b, u8 m,
> > > +                     bool w, u8 src_reg2, bool l, u8 p)
> >
> > Can you please use somewhat more descriptive variable names here?
> >
> > or add more information about what x, b, m, w, l and p mean?
>
> Apart from src_reg2, the rest is the same as what Intel has chosen to
> name the various fields in the VEX prefix. Would rather keep them
> consistent so that people won't get confused when comparing with other
> documents across the Internet.

Sure, but it would be nice to have a comment about what they mean.

These bits allow indexing various kinds of registers. e.g.

"VEX.~R allows is an extra bit for indexing the ModRM register" or something
similar, if it's preferred not to change the variable names.

> >
> > > +{
> > > +       u8 *prog = *pprog;
> > > +       u8 b0 = 0xc4, b1, b2;
> > > +       u8 src2 = reg2hex[src_reg2];
> > > +
> > > +       if (is_ereg(src_reg2))
> > > +               src2 |= 1 << 3;
> > > +
> > > +       /*
> > > +        *    7                           0
> > > +        *  +---+---+---+---+---+---+---+---+
> > > +        *  |~R |~X |~B |         m         |
> > > +        *  +---+---+---+---+---+---+---+---+
> > > +        */
> > > +       b1 = (!r << 7) | (!x << 6) | (!b << 5) | (m & 0x1f);
> >
> > Some explanation here would help, not everyone is aware of x86 vex encoding :)
>
> The comment is the exact rule how different pieces of information is
> encoded into the 3-byte VEX prefix i.e. their position and length, and
> whether a field needs to be bit inverted. Combined with code the comment
> should give one clear idea what the intent is here.

I am not sure this gives a clear picture, It assumes that the reader knows
about VEX encoding, which not everyone does. Now they can pull up a
manual and start reading but that doesn't help so the comments need to
explain what's going on here.

> >
> > > +       /*
> > > +        *    7                           0
> > > +        *  +---+---+---+---+---+---+---+---+
> > > +        *  | W |     ~vvvv     | L |   pp  |
> > > +        *  +---+---+---+---+---+---+---+---+
> > > +        */
> > > +       b2 = (w << 7) | ((~src2 & 0xf) << 3) | (l << 2) | (p & 3);

By reading the code one should be able to understand what
b0, b1 and b2 are.

> > > +
> > > +       EMIT3(b0, b1, b2);
> > > +       *pprog = prog;
> > > +}
> > > +
> > > +/* emit BMI2 shift instruction */
> > > +static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
> > > +{
> > > +       u8 *prog = *pprog;
> > > +       bool r = is_ereg(dst_reg);
> > > +       u8 m = 2; /* escape code 0f38 */
> > > +
> > > +       emit_3vex(&prog, r, false, r, m, is64, src_reg, false, op);
> > > +       EMIT2(0xf7, add_2reg(0xC0, dst_reg, dst_reg));
> > > +       *pprog = prog;
> > > +}
> > > +
> > >  #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
> > >
> > >  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
> > > @@ -1135,6 +1177,28 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> > >                 case BPF_ALU64 | BPF_LSH | BPF_X:
> > >                 case BPF_ALU64 | BPF_RSH | BPF_X:
> > >                 case BPF_ALU64 | BPF_ARSH | BPF_X:
> > > +                       /* BMI2 shifts aren't better when shift count is already in rcx */
> > > +                       if (boot_cpu_has(X86_FEATURE_BMI2) && src_reg != BPF_REG_4) {
> > > +                               /* shrx/sarx/shlx dst_reg, dst_reg, src_reg */
> > > +                               bool w = (BPF_CLASS(insn->code) == BPF_ALU64);
> > > +                               u8 op;
> > > +
> > > +                               switch (BPF_OP(insn->code)) {
> > > +                               case BPF_LSH:
> > > +                                       op = 1; /* prefix 0x66 */
> > > +                                       break;
> > > +                               case BPF_RSH:
> > > +                                       op = 3; /* prefix 0xf2 */
> > > +                                       break;
> > > +                               case BPF_ARSH:
> > > +                                       op = 2; /* prefix 0xf3 */
> > > +                                       break;
> > > +                               }
> > > +
> > > +                               emit_shiftx(&prog, dst_reg, src_reg, w, op);
> > > +
> > > +                               break;
> > > +                       }
> > >
> > >                         if (src_reg != BPF_REG_4) { /* common case */
> > >                                 /* Check for bad case when dst_reg == rcx */
> > > --
> > > 2.30.2
> > >