Hi Andrii, On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > Add a new BPF instruction for resolving per-CPU memory addresses. > > > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset > > to an absolute address where per-CPU data resides for "this" CPU. > > > > This patch set implements support for it in x86-64 BPF JIT only. > > > > Using the new instruction, we also implement inlining for three cases: > > - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial > > function call, saving a bit of performance and also not polluting LBR > > records with unnecessary function call/return records; > > - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its > > performance to implementing per-CPU data structures using global variables > > in BPF (which is an awesome improvement, see benchmarks below); > > - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the > > same for non-PERCPU HASH map; this still saves a bit of overhead. > > > > To validate performance benefits, I hacked together a tiny benchmark doing > > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY > > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To > > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based > > on global variable array using bpf_get_smp_processor_id() to index array for > > current CPU (glob-arr-inc benchmark below). Can you share the code for these benchmarks? I want to use the same to compare the performance on ARM64. Thanks, Puranjay