Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 4 Apr 2024 14:21:33 -0700

On Thu, Apr 4, 2024 at 2:03 PM Puranjay Mohan <puranjay12@xxxxxxxxx> wrote:
>
> Hi Andrii,
>
> On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > >
> > > Add a new BPF instruction for resolving per-CPU memory addresses.
> > >
> > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > > to an absolute address where per-CPU data resides for "this" CPU.
> > >
> > > This patch set implements support for it in x86-64 BPF JIT only.
> > >
> > > Using the new instruction, we also implement inlining for three cases:
> > >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> > >     function call, saving a bit of performance and also not polluting LBR
> > >     records with unnecessary function call/return records;
> > >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> > >     performance to implementing per-CPU data structures using global variables
> > >     in BPF (which is an awesome improvement, see benchmarks below);
> > >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> > >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> > >
> > > To validate performance benefits, I hacked together a tiny benchmark doing
> > > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > > on global variable array using bpf_get_smp_processor_id() to index array for
> > > current CPU (glob-arr-inc benchmark below).
>
> Can you share the code for these benchmarks? I want to use the same to
> compare the performance
> on ARM64.
>

Sure, see [0]. You can run:

$ ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

from tools/testing/selftest/bpf directory to get something like

$ ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
glob-arr-inc   :  243.196 ± 7.879M/s
arr-inc        :  218.139 ± 4.407M/s
hash-inc       :   97.727 ± 3.643M/s

  [0] https://github.com/anakryiko/linux/commit/8dec900975ef1a0308a7862154735549d6b66f64

> Thanks,
> Puranjay