On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: > > From: Puranjay Mohan <puranjay12@xxxxxxxxx> > > Support an instruction for resolving absolute addresses of per-CPU > data from their per-CPU offsets. This instruction is internal-only and > users are not allowed to use them directly. They will only be used for > internal inlining optimizations for now between BPF verifier and BPF > JITs. > > Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu > access using tpidr_el1"), the per-cpu offset for the CPU is stored in > the tpidr_el1/2 register of that CPU. > > To support this BPF instruction in the ARM64 JIT, the following ARM64 > instructions are emitted: > > mov dst, src // Move src to dst, if src != dst > mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp. > add dst, dst, tmp // Add the per cpu offset to the dst. > > To measure the performance improvement provided by this change, the > benchmark in [1] was used: > > Before: > glob-arr-inc : 23.597 ± 0.012M/s > arr-inc : 23.173 ± 0.019M/s > hash-inc : 12.186 ± 0.028M/s > > After: > glob-arr-inc : 23.819 ± 0.034M/s > arr-inc : 23.285 ± 0.017M/s I still expected a better improvement (global-arr-inc's results improved more than arr-inc, which is completely different from x86-64), but it's still a good thing to support this for arm64, of course. ack for generic parts I can understand: Acked-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > hash-inc : 12.419 ± 0.011M/s > > [1] https://github.com/anakryiko/linux/commit/8dec900975ef > > Signed-off-by: Puranjay Mohan <puranjay12@xxxxxxxxx> > --- > arch/arm64/include/asm/insn.h | 7 +++++++ > arch/arm64/lib/insn.c | 11 +++++++++++ > arch/arm64/net/bpf_jit.h | 6 ++++++ > arch/arm64/net/bpf_jit_comp.c | 14 ++++++++++++++ > 4 files changed, 38 insertions(+) > [...]