Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 2 Apr 2024 09:20:16 -0700

On Mon, Apr 1, 2024 at 9:57 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> Andrii Nakryiko wrote:
> > Add a new BPF instruction for resolving per-CPU memory addresses.
> >
> > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > to an absolute address where per-CPU data resides for "this" CPU.
> >
> > This patch set implements support for it in x86-64 BPF JIT only.
> >
> > Using the new instruction, we also implement inlining for three cases:
> >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> >     function call, saving a bit of performance and also not polluting LBR
> >     records with unnecessary function call/return records;
> >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> >     performance to implementing per-CPU data structures using global variables
> >     in BPF (which is an awesome improvement, see benchmarks below);
> >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> >
> > To validate performance benefits, I hacked together a tiny benchmark doing
> > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > on global variable array using bpf_get_smp_processor_id() to index array for
> > current CPU (glob-arr-inc benchmark below).
> >
> > BEFORE
> > ======
> > glob-arr-inc   :  163.685 ± 0.092M/s
> > arr-inc        :  138.096 ± 0.160M/s
> > hash-inc       :   66.855 ± 0.123M/s
> >
> > AFTER
> > =====
> > glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> > arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> > hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> >
> > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> >
> > But what's really important is that arr-inc benchmark basically catches up
> > with glob-arr-inc, resulting in +23.7% improvement. This means that in
> > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> > critical (e.g., high-frequent stats collection, which is often a practical use
> > for PERCPU_ARRAY today).
>
> Out of curiousity did we consider exposing this instruction outside internal
> inlining? It seems it would help compiler some to not believe its doing a
> function call.

We decided to start as internal-only to try it out and get inlining
benefits, without being stuck in a longer discussion to define the
exact user-visible semantics, guarantees, etc. Given this instruction
calculates memory addresses that are meant to be dereferenced
directly, we'd need to be careful to make sure all the safety aspects
are carefully considered.

Though you reminded me that we should probably also implement inlining
of bpf_this_cpu_ptr() and maybe even bpf_per_cpu_ptr() (I'll need to
double-check how arbitrary CPU address is calculated). Maybe as a
follow up.

>
> We could do some runtime rewrites to find the address for global vars for
> example.
>
> FWIW I don't think one should block this necessarily perhaps as follow up?
> Or at least worth considering if I didn't miss some reason its not
> plausible.

No blocker in principle, but certainly we'd need to be more careful if
we expose the instruction to users.

>
> >
> > v1->v2:
> >   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
> >   - dropped the direct per-CPU memory read instruction, it can always be added
> >     back, if necessary;
> >   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
> >   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
> >     warnings.
> >
> > Andrii Nakryiko (4):
> >   bpf: add special internal-only MOV instruction to resolve per-CPU
> >     addrs
> >   bpf: inline bpf_get_smp_processor_id() helper
> >   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
> >   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> >
> >  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
> >  include/linux/filter.h      | 20 ++++++++++++++++++++
> >  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
> >  kernel/bpf/core.c           |  5 +++++
> >  kernel/bpf/disasm.c         | 14 ++++++++++++++
> >  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
> >  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
> >  7 files changed, 133 insertions(+)
> >
> > --
> > 2.43.0
> >
> >
>
>