Andrii Nakryiko wrote: > On Mon, Apr 1, 2024 at 9:57 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote: > > > > Andrii Nakryiko wrote: > > > Add a new BPF instruction for resolving per-CPU memory addresses. > > > > > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with > > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset > > > to an absolute address where per-CPU data resides for "this" CPU. > > > > > > This patch set implements support for it in x86-64 BPF JIT only. > > > > > > Using the new instruction, we also implement inlining for three cases: > > > - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial > > > function call, saving a bit of performance and also not polluting LBR > > > records with unnecessary function call/return records; > > > - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its > > > performance to implementing per-CPU data structures using global variables > > > in BPF (which is an awesome improvement, see benchmarks below); > > > - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the > > > same for non-PERCPU HASH map; this still saves a bit of overhead. > > > > > > To validate performance benefits, I hacked together a tiny benchmark doing > > > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY > > > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To > > > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based > > > on global variable array using bpf_get_smp_processor_id() to index array for > > > current CPU (glob-arr-inc benchmark below). > > > > > > BEFORE > > > ====== > > > glob-arr-inc : 163.685 ± 0.092M/s > > > arr-inc : 138.096 ± 0.160M/s > > > hash-inc : 66.855 ± 0.123M/s > > > > > > AFTER > > > ===== > > > glob-arr-inc : 173.921 ± 0.039M/s (+6%) > > > arr-inc : 170.729 ± 0.210M/s (+23.7%) > > > hash-inc : 68.673 ± 0.070M/s (+2.7%) > > > > > > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global > > > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id(). > > > > > > But what's really important is that arr-inc benchmark basically catches up > > > with glob-arr-inc, resulting in +23.7% improvement. This means that in > > > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is > > > critical (e.g., high-frequent stats collection, which is often a practical use > > > for PERCPU_ARRAY today). > > > > Out of curiousity did we consider exposing this instruction outside internal > > inlining? It seems it would help compiler some to not believe its doing a > > function call. > > We decided to start as internal-only to try it out and get inlining > benefits, without being stuck in a longer discussion to define the > exact user-visible semantics, guarantees, etc. Given this instruction > calculates memory addresses that are meant to be dereferenced > directly, we'd need to be careful to make sure all the safety aspects > are carefully considered. > > Though you reminded me that we should probably also implement inlining > of bpf_this_cpu_ptr() and maybe even bpf_per_cpu_ptr() (I'll need to > double-check how arbitrary CPU address is calculated). Maybe as a > follow up. > > > > > We could do some runtime rewrites to find the address for global vars for > > example. > > > > FWIW I don't think one should block this necessarily perhaps as follow up? > > Or at least worth considering if I didn't miss some reason its not > > plausible. > > No blocker in principle, but certainly we'd need to be more careful if > we expose the instruction to users. All sounds good that was my understanding as well. I see it got merged but for what its worth I went through and ran some of this and rest of patches lgtm, Reviewed-by: John Fastabend <john.fastabend@xxxxxxxxx> Tested-by: John Fastabend <john.fastabend@xxxxxxxxx> > > > > > > > > > v1->v2: > > > - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei); > > > - dropped the direct per-CPU memory read instruction, it can always be added > > > back, if necessary; > > > - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei); > > > - switched all per-cpu addr casts to (unsigned long) to avoid sparse > > > warnings. > > > > > > Andrii Nakryiko (4): > > > bpf: add special internal-only MOV instruction to resolve per-CPU > > > addrs > > > bpf: inline bpf_get_smp_processor_id() helper > > > bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps > > > bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map > > > > > > arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++ > > > include/linux/filter.h | 20 ++++++++++++++++++++ > > > kernel/bpf/arraymap.c | 33 +++++++++++++++++++++++++++++++++ > > > kernel/bpf/core.c | 5 +++++ > > > kernel/bpf/disasm.c | 14 ++++++++++++++ > > > kernel/bpf/hashtab.c | 21 +++++++++++++++++++++ > > > kernel/bpf/verifier.c | 24 ++++++++++++++++++++++++ > > > 7 files changed, 133 insertions(+) > > > > > > -- > > > 2.43.0 > > > > > > > > > >