Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction

Puranjay Mohan <puranjay12@xxxxxxxxx> · Thu, 4 Apr 2024 18:16:41 +0200

On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > Add a new BPF instruction for resolving per-CPU memory addresses.
> >
> > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > to an absolute address where per-CPU data resides for "this" CPU.
> >
> > This patch set implements support for it in x86-64 BPF JIT only.
> >
> > Using the new instruction, we also implement inlining for three cases:
> >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> >     function call, saving a bit of performance and also not polluting LBR
> >     records with unnecessary function call/return records;
> >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> >     performance to implementing per-CPU data structures using global variables
> >     in BPF (which is an awesome improvement, see benchmarks below);
> >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> >
> > To validate performance benefits, I hacked together a tiny benchmark doing
> > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > on global variable array using bpf_get_smp_processor_id() to index array for
> > current CPU (glob-arr-inc benchmark below).
> >
> > BEFORE
> > ======
> > glob-arr-inc   :  163.685 ± 0.092M/s
> > arr-inc        :  138.096 ± 0.160M/s
> > hash-inc       :   66.855 ± 0.123M/s
> >
> > AFTER
> > =====
> > glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> > arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> > hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> >
> > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> >
> > But what's really important is that arr-inc benchmark basically catches up
> > with glob-arr-inc, resulting in +23.7% improvement. This means that in
> > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> > critical (e.g., high-frequent stats collection, which is often a practical use
> > for PERCPU_ARRAY today).
> >
> > v1->v2:
> >   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
> >   - dropped the direct per-CPU memory read instruction, it can always be added
> >     back, if necessary;
> >   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
> >   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
> >     warnings.
> >
> > Andrii Nakryiko (4):
> >   bpf: add special internal-only MOV instruction to resolve per-CPU
> >     addrs
> >   bpf: inline bpf_get_smp_processor_id() helper
> >   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
> >   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> >
> >  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
> >  include/linux/filter.h      | 20 ++++++++++++++++++++
> >  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
> >  kernel/bpf/core.c           |  5 +++++
> >  kernel/bpf/disasm.c         | 14 ++++++++++++++
> >  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
> >  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
> >  7 files changed, 133 insertions(+)
> >
> > --
> > 2.43.0
> >
>
> Puranjay, Pu,
>
> Is this something you guys can help implement for ARM64 and RISC-V
> architectures as well? I'm lacking the knowledge of the specifics of
> this architecture to be able to do this by myself, but it is a nice
> improvement for BPF applications, so it would be nice to get wide
> architecture support.
>
> Thank you!

I have already started implementing this on ARM64,

Pu, can you do it for RISC-V? If you don't have cycles then I could do it after
finishing ARM64.

Thanks,
Puranjay