Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 22 Mar 2024 09:45:29 -0700

On Thu, Mar 21, 2024 at 4:49 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> > gs segment-based addressing.
> >
> > Just to be on the safer side both rip-relative addressing is implemented
> > (providing a shorter instruction, but limiting offset to signed 32 bits)
> > and more universal absolute memory offset addressing is used as
> > a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> > The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> > instructions when compiling kernel code.
> >
> > Both instructions were tested and confirmed using gdb. We also already
> > have a BPF selftest (raw_tp_test_run) that validates correctness of
> > bpf_get_smp_processor_id(), while running target BPF program on each
> > online CPU.
> >
> > Here's a disassembly of bpf_get_smp_processor_id() helper:
> >
> > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> > Dump of assembler code for function bpf_get_smp_processor_id:
> >    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> >    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
> >    0xffffffff810fa89c <+12>:    48 98                   cdqe
> >    0xffffffff810fa89e <+14>:    c3                      ret
> > End of assembler dump.
> >
> > And here's a GDB disassembly dump of a piece of BPF program calling
> > bpf_get_smp_processor_id().
> >
> >   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
> >   000000000002d840 A pcpu_hot
> >   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
> >
> > Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> > program:
> >
> > (gdb) b *0xffffffffa000f8e2
> > Breakpoint 1 at 0xffffffffa000f8e2
> >
> > When RIP-relative instruction is used:
> >
> >   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
> >   0xffffffffa000f8e9      cltq
> >
> > You can see that final address is resolved to <pcpu_hot+12> as expected.
> >
> > When absolute addressing is used:
> >
> >   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
> >   0xffffffffa000f8ed      cltq
> >
> > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> > plus 12 (0xc) bytes offset of cpu_number field.
> >
> > This inlining eliminates entire function call for this (rather trivial in terms
> > of instructions executed) helper, saving a bit of performance, but foremost
> > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> > place.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 4900b1ee019f..5b7fdc24b5b8 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >         *pprog = prog;
> >  }
> >
> > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> > +
> >  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> >  {
> >         u8 *prog = *pprog;
> > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> >                 pr_err("Target call %p is out of range\n", func);
> >                 return -ERANGE;
> >         }
> > -       EMIT1_off32(opcode, offset);
> > +
> > +       /* inline bpf_get_smp_processor_id() to avoid calls */
> > +       if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> > +               /* 7 to account for the mov instruction itself,
> > +                * as rip value *after* mov instruction is used
> > +                */
> > +               offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> > +               if (is_simm32(offset)) {
> > +                       /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> > +                       EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> > +               } else {
> > +                       /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> > +                       offset = (s64)(void *)&pcpu_hot.cpu_number;
> > +                       EMIT2(0x65, 0xa1);
> > +                       EMIT((u32)offset, 4);
> > +                       EMIT((u64)offset >> 32, 4);
> > +               }
> > +               EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
>
> Please introduce new pseudo insn that can access per-cpu vars
> in a generic way instead of hacking a specific case.

Sure, but do they have to be mutually exclusive? One doesn't prevent
the other. Having bpf_get_smp_processor_id() inlined benefits tons of
existing BPF applications transparently, which sounds like a win to
me.

Designing and adding new instruction also sounds fine, but it's a
separate feature and is more involved and will require careful
considerations. E.g., we'll need to think through whether all JITs
will be able to implement it in native code (i.e., whether they will
have enough free registers to implement this efficiently; x86-64 is
lucky to not need any extra registers, I believe ARM64 needs one extra
register, though; other architectures I have no idea). Safety
considerations as well (do we accept any random offset, or only ones
coming from per-CPU BTF variables, etc). I don't know if there are any
differences, but per-CPU access for module per-CPU variables is
something to look into (I have no clue).

And even once we have this instruction and corresponding compiler
support, it will still take a while until applications can assume its
availability, so that adds to logistics.

Also, even with this instruction, getting CPU ID efficiently is still
important for cases when a BPF application needs its own per-CPU
"storage" solution. I'm not sure it's a good experience to require
every user to figure out the pcpu_hot.cpu_number thing by themselves
through a few layers of kernel macros.

As an interim compromise and solution, would you like me to implement
a similar inlining for bpf_this_cpu_ptr() as well? It's just as
trivial to do as for bpf_get_smp_processor_id(), and would benefit all
existing users of bpf_this_cpu_ptr() as well, while relying on
existing BTF information and surrounding infrastructure?

What's the worst outcome? If the kernel changes how CPU number is
defined (not pcpu_hot.cpu_number) and we can't easily adapt JIT logic,
we can just stop doing inlining and we'll lose a bit of performance
and function call avoidance. Bad, but not API breaking or anything
like that. And we will detect when things change, we have a test that
checks this logic for each CPU, making sure we get the right one.

> Then we can use it in get_lookup in percpu array and hash maps
> improving their performance and in lots of other places.
>
> pw-bot: cr