This patch-set seeks to allow using no_caller_saved_registers gcc/clang attribute with some BPF helper functions (and kfuncs in the future). As documented in [1], this attribute means that function scratches only some of the caller saved registers defined by ABI. For BPF the set of such registers could be defined as follows: - R0 is scratched only if function is non-void; - R1-R5 are scratched only if corresponding parameter type is defined in the function prototype. The goal of the patch-set is to implement no_caller_saved_registers (nocsr for short) in a backwards compatible manner: - for kernels that support the feature, gain some performance boost from better register allocation; - for kernels that don't support the feature, allow programs execution with minor performance losses. To achieve this, use a scheme suggested by Alexei Starovoitov: - for nocsr calls clang allocates registers as-if relevant r0-r5 registers are not scratched by the call; - as a post-processing step, clang visits each nocsr call and adds spill/fill for every live r0-r5; - stack offsets used for spills/fills are allocated as lowest stack offsets in whole function and are not used for any other purpose; - when kernel loads a program, it looks for such patterns (nocsr function surrounded by spills/fills) and checks if spill/fill stack offsets are used exclusively in nocsr patterns; - if so, and if current JIT inlines the call to the nocsr function (e.g. a helper call), kernel removes unnecessary spill/fill pairs; - when old kernel loads a program, presence of spill/fill pairs keeps BPF program valid, albeit slightly less efficient. Corresponding clang/llvm changes are available in [2]. The patch-set uses bpf_get_smp_processor_id() function as a canary, making it the first helper with nocsr attribute. For example, consider the following program: #define __no_csr __attribute__((no_caller_saved_registers)) #define SEC(name) __attribute__((section(name), used)) #define bpf_printk(fmt, ...) bpf_trace_printk((fmt), sizeof(fmt), __VA_ARGS__) typedef unsigned int __u32; static long (* const bpf_trace_printk)(const char *fmt, __u32 fmt_size, ...) = (void *) 6; static __u32 (*const bpf_get_smp_processor_id)(void) __no_csr = (void *)8; SEC("raw_tp") int test(void *ctx) { __u32 task = bpf_get_smp_processor_id(); bpf_printk("ctx=%p, smp=%d", ctx, task); return 0; } char _license[] SEC("license") = "GPL"; Compiled (using [2]) as follows: $ clang --target=bpf -O2 -g -c -o nocsr.bpf.o nocsr.bpf.c $ llvm-objdump --no-show-raw-insn -Sd nocsr.bpf.o ... 3rd parameter for printk call removable spill/fill pair .--- 0: r3 = r1 | ; | __u32 task = bpf_get_smp_processor_id(); | | 1: *(u64 *)(r10 - 0x8) = r3 <----------| | 2: call 0x8 | | 3: r3 = *(u64 *)(r10 - 0x8) <----------' ; | bpf_printk("ctx=%p, smp=%d", ctx, task); | 4: r1 = 0x0 ll | 6: r2 = 0xf | 7: r4 = r0 '--> 8: call 0x6 ; return 0; 9: r0 = 0x0 10: exit Here is how the program looks after verifier processing: # bpftool prog load ./nocsr.bpf.o /sys/fs/bpf/nocsr-test # bpftool prog dump xlated pinned /sys/fs/bpf/nocsr-test int test(void * ctx): 0: (bf) r3 = r1 <--- 3rd printk parameter ; __u32 task = bpf_get_smp_processor_id(); 1: (b4) w0 = 197324 <--. inlined helper call, 2: (bf) r0 = &(void __percpu *)(r0) <--- spill/fill 3: (61) r0 = *(u32 *)(r0 +0) <--' pair removed ; bpf_printk("ctx=%p, smp=%d", ctx, task); 4: (18) r1 = map[id:5][0]+0 6: (b7) r2 = 15 7: (bf) r4 = r0 8: (85) call bpf_trace_printk#-125920 ; return 0; 9: (b7) r0 = 0 10: (95) exit [1] https://clang.llvm.org/docs/AttributeReference.html#no-caller-saved-registers [2] https://github.com/eddyz87/llvm-project/tree/bpf-no-caller-saved-registers Change list: - v2 -> v3: - As suggested by Andrii, 'nocsr_stack_off' is no longer checked at rewrite time, instead mark_nocsr_patterns() now does two passes over BPF program: - on a first pass it computes the lowest stack spill offset for the subprogram; - on a second pass this offset is used to recognize nocsr pattern. - As suggested by Alexei, a new mechanic is added to work around a situation mentioned by Andrii, when more helper functions are marked as nocsr at compile time than current kernel supports: - all {spill*,helper call,fill*} patterns are now marked as insn_aux_data[*].nocsr_pattern, thus relaxing failure condition for check_nocsr_stack_contract(); - spill/fill pairs are not removed for patterns where helper can't be inlined; - see mark_nocsr_pattern_for_call() for details an example. - As suggested by Alexei, subprogram stack depth is now adjusted if all spill/fill pairs could be removed. This adjustment has to take place before optimize_bpf_loop(), hence the rewrite is moved from do_misc_fixups() to remove_nocsr_spills_fills() (again). - As suggested by Andrii, special measures are taken to work around bpf_probe_read_kernel() access to BPF stack, see patches 11, 12. Patch #11 is very simplistic, a more comprehensive solution would be to change the type of the third parameter of the bpf_probe_read_kernel() from ARG_ANYTHING to something else and not only check nocsr contract, but also propagate stack slot liveness information. However, such change would require update in struct bpf_call_arg_meta processing, which currently implies that every memory parameter is followed by a size parameter. I can work on these changes, please comment. - Stylistic changes suggested by Andrii. - Added acks from Andrii. - Dropped RFC tag. - v1 -> v2: - assume that functions inlined by either jit or verifier conform to no_caller_saved_registers contract (Andrii, Puranjay); - allow nocsr rewrite for bpf_get_smp_processor_id() on arm64 and riscv64 architectures (Puranjay); - __arch_{x86_64,arm64,riscv64} macro for test_loader; - moved remove_nocsr_spills_fills() inside do_misc_fixups() (Andrii); - moved nocsr pattern detection from check_cfg() to a separate pass (Andrii); - various stylistic/correctness changes according to Andrii's comments. Revisions: - v1 https://lore.kernel.org/bpf/20240629094733.3863850-1-eddyz87@xxxxxxxxx/ - v2 https://lore.kernel.org/bpf/20240704102402.1644916-1-eddyz87@xxxxxxxxx/ Eduard Zingerman (12): bpf: add a get_helper_proto() utility function bpf: no_caller_saved_registers attribute for helper calls bpf, x86, riscv, arm: no_caller_saved_registers for bpf_get_smp_processor_id() selftests/bpf: extract utility function for BPF disassembly selftests/bpf: print correct offset for pseudo calls in disasm_insn() selftests/bpf: no need to track next_match_pos in struct test_loader selftests/bpf: extract test_loader->expect_msgs as a data structure selftests/bpf: allow checking xlated programs in verifier_* tests selftests/bpf: __arch_* macro to limit test cases to specific archs selftests/bpf: test no_caller_saved_registers spill/fill removal bpf: do check_nocsr_stack_contract() for ARG_ANYTHING helper params selftests/bpf: check nocsr contract for bpf_probe_read_kernel() include/linux/bpf.h | 6 + include/linux/bpf_verifier.h | 14 + kernel/bpf/helpers.c | 1 + kernel/bpf/verifier.c | 333 +++++++- tools/testing/selftests/bpf/Makefile | 1 + tools/testing/selftests/bpf/disasm_helpers.c | 69 ++ tools/testing/selftests/bpf/disasm_helpers.h | 12 + .../selftests/bpf/prog_tests/ctx_rewrite.c | 74 +- .../selftests/bpf/prog_tests/verifier.c | 2 + tools/testing/selftests/bpf/progs/bpf_misc.h | 13 + .../selftests/bpf/progs/verifier_nocsr.c | 794 ++++++++++++++++++ tools/testing/selftests/bpf/test_loader.c | 217 +++-- tools/testing/selftests/bpf/test_progs.h | 1 - tools/testing/selftests/bpf/testing_helpers.c | 1 + 14 files changed, 1414 insertions(+), 124 deletions(-) create mode 100644 tools/testing/selftests/bpf/disasm_helpers.c create mode 100644 tools/testing/selftests/bpf/disasm_helpers.h create mode 100644 tools/testing/selftests/bpf/progs/verifier_nocsr.c -- 2.45.2