Add two new BPF instructions for dealing with per-CPU memory. One, BPF_LDX | BPF_ADDR_PERCPU | BPF_DW (where BPF_ADD_PERCPU is unused 0xe0 opcode), resolved provided per-CPU address (offset) to an absolute address where per-CPU data resides for "this" CPU. This is the most universal, and, strictly speaking, the only per-CPU BPF instruction necessary. I also added BPF_LDX | BPF_MEM_PERCPU | BPF_{B,H,W,DW} (BPF_MEM_PERCPU using another unused 0xc0 opcode), which can be considered an optimization instruction, which allows to *read* per-CPU data up to 8 bytes in one instruction, without having to first resolve the address and then dereferencing the memory. This one is used in inlining of bpf_get_smp_processor_id(), but it would be fine to implement the latter with BPF_ADD_PERCPU, followed by normal BPF_LDX | BPF_MEM, so I'm fine dropping this one, if requested. This instructions are currently supported by x86-64 BPF JIT, but it would be great if this was added for other arches ASAP, of course. In either case, we also implement inlining for three cases: - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial function call, saving a bit of performance and also not polluting LBR records with unnecessary function call/return records; - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its performance to implementing per-CPU data structures using global variables in BPF (which is an awesome improvement, see benchmarks below); - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the same for non-PERCPU HASH map; this still saves a bit of overhead. To validate performance benefits, I hacked together a tiny benchmark doing only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To establish a baseline, I also implemented logic similar to PERCPU_ARRAY based on global variable array using bpf_get_smp_processor_id() to index array for current CPU (glob-arr-inc benchmark below). BEFORE ====== glob-arr-inc : 163.685 ± 0.092M/s arr-inc : 138.096 ± 0.160M/s hash-inc : 66.855 ± 0.123M/s AFTER ===== glob-arr-inc : 173.921 ± 0.039M/s (+6%) arr-inc : 170.729 ± 0.210M/s (+23.7%) hash-inc : 68.673 ± 0.070M/s (+2.7%) As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id(). But what's really important is that arr-inc benchmark basically catches up with glob-arr-inc, resulting in +23.7% improvement. This means that in practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is critical (e.g., high-frequent stats collection, which is often a practical use for PERCPU_ARRAY today). Andrii Nakryiko (4): bpf: add internal-only per-CPU LDX instructions bpf: inline bpf_get_smp_processor_id() helper bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map arch/x86/net/bpf_jit_comp.c | 29 +++++++++++++++++++++++++++++ include/linux/filter.h | 27 +++++++++++++++++++++++++++ kernel/bpf/arraymap.c | 33 +++++++++++++++++++++++++++++++++ kernel/bpf/core.c | 5 +++++ kernel/bpf/disasm.c | 33 ++++++++++++++++++++++++++------- kernel/bpf/hashtab.c | 21 +++++++++++++++++++++ kernel/bpf/verifier.c | 17 +++++++++++++++++ 7 files changed, 158 insertions(+), 7 deletions(-) -- 2.43.0