v6->v7: - Limit the caller's stack depth down to 256 so that at worst case of tailcall/bpf2bpf combination the stack size will not exceed 8k (Alexei) v5->v6: - propagate only those poke descriptors that individual subprogram is actually using (Daniel) - drop the cumbersome check if poke desc got filled in map_poke_run() - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4 to patch 3 to provide bisectability (Daniel) v4->v5: - simplify the fix from v3/v4 (Daniel) v3->v4: - be more careful around the fix from v3 v2->v3: - call map_poke_untrack() on each previously registered subprog's aux struct to prog array if adding poke descriptor or tracking the aux struct failed (Daniel) v1->v2: - include the rax->rcx conversion in first patch, target prog needs to be placed in rcx in the tailcall indirect routine (Daniel) - add error checks to routines that add poke descriptors to subprograms (Daniel) - don't allow this optimization when arch is different than x64 and when JIT is disabled (Daniel) - pull out the rename of poke desc members onto a separate patch (Daniel) - add a new poke member to store the bypass address so that calculation of it won't be necessary - avoid the special casing when old and new is null in map_poke_run (Daniel) - do not sync RCU when bypass target was not patched (Daniel) - do not introduce nop2 instruction to prologue for cBPF programs (Daniel) RFC->v1: - rename poke->ip/poke->ip_aux pair to poke->tailcall_target/poke->tailcall_bypass (Alexei) - get rid of x86-specific code in prog_array_map_poke_run (Alexei) - use synchronize_rcu in prog_array_map_poke_run so that other CPUs in the middle of execution will finish running the program and will not stumble upon the incorrect state (Alexei) - update performance reports - rebase Hello, today bpf2bpf calls and tailcalls exclude each other. This set makes them work together. To give you an overview how this work started, previously I posted RFC that was targetted at getting rid of push/pop instructions for callee saved registers in x86-64 JIT that are not used by the BPF program. Alexei saw a potential that that work could be lifted a bit and tailcalls could work with BPF subprograms. More on that in [1], [2]. For previous discussions on RFC version, see [3]. For v1, see [4]. v2 is in [5], v3 can be ignored. On v6 [6], Daniel spotted that this work can expose us to kernel stack overflow, as we allow the bpf2bpf calls with tailcall in concert. This means that for the worst case we could hit the 512 (MAX_BPF_STACK) * 32 (MAX_TAIL_CALL_CNT) = 16k stack size. To address that, Alexei suggest to reduce the caller's max stack size down to 256 if we detect that BPF program that is being loaded has tailcalls in its subprograms. In [1], Alexei says: "The prologue will look like: nop5 xor eax,eax // two new bytes if bpf_tail_call() is used in this function push rbp mov rbp, rsp sub rsp, rounded_stack_depth push rax // zero init tail_call counter variable number of push rbx,r13,r14,r15 Then bpf_tail_call will pop variable number rbx,.. and final 'pop rax' Then 'add rsp, size_of_current_stack_frame' jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp' This way new function will set its own stack size and will init tail call counter with whatever value the parent had. If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. Instead it would need to have 'nop2' in there." So basically I gave a shot at that suggestion. Patch 4 has a description of implementation. Quick overview of patches: Patch 1 changes BPF retpoline to use %rcx instead of %rax to store address of BPF tailcall target program Patch 2 propagates poke descriptors from main program to each subprogram Patch 3 renames poke->ip to poke->tailcall_target Patch 4 is the main dish in this set. It implements new prologue layout that was suggested by Alexei and reworks tailcall handling Patch 5 limits the max stack depth to 8k if tailcalls are combined with BPF subprograms Patch 6 relaxes verifier's restrictions about tailcalls being used with BPF subprograms for x64 JIT Patch 7 is the new selftest that proves tailcalls can be used from within BPF subprogram. ------------------------------------------------------------------- prog_array_map_poke_run changes: Before the tailcall and with the new prologue layout, stack need to be unwinded and callee saved registers need to be popped. Instructions responsible for that are generated, but they should not be executed if target program is not present. To address that, new poke target 'tailcall_bypass' is introduced to poke descriptor that will be used for skipping these instructions. This means there are two poke targets for handling direct tailcalls. Simplified flow can be presented as three sections: 1. skip call or nop (poke->tailcall_bypass) 2. stack unwind 3. call tail or nop (poke->tailcall_target) It would be possible under specific circumstances that one of CPU might be in point 2 and point 3 is not yet updated (nop), which would lead to problems mentioned in patch 4 commit message, IOW unwind section should not be executed if there is no target program. We can define the following state matrix for that (courtesy of Bjorn): A nop, unwind, nop B nop, unwind, tail C skip, unwind, nop D skip, unwind, tail A is forbidden (lead to incorrectness). The state transitions between tailcall install/update/remove will work as follows: First install tail call f: C->D->B(f) * poke the tailcall, after that get rid of the skip Update tail call f to f': B(f)->B(f') * poke the tailcall (poke->tailcall_target) and do NOT touch the poke->tailcall_bypass Remove tail call: B(f')->C(f') * poke->tailcall_bypass is poked back to jump, then we wait the RCU grace period so that other programs will finish its execution and after that we are safe to remove the poke->tailcall_target Install new tail call (f''): C(f')->D(f'')->B(f''). * same as first step This way CPU can never be exposed to "unwind, tail" state. ------------------------------------------------------------------- Performance impact: All of this work, as stated in [2], started as a way to speed up AF-XDP by dropping the push/pop of unused callee saved registers in prologue and epilogue. Impact is positive, 15% of performance gain. However, it is obvious that it will have a negative impact on BPF programs that utilize tailcalls, but we think its volume is acceptable for the feature that this set contains. Below are te numbers from 'perf stat' for two scenarios. First scenario is the output of command: $ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls tailcalls kselftest was modified in a following way: - only tailcall1 subtest is enabled - each of the bpf_prog_test_run() calls got set 'repeat' argument to 1000000 Numbers without this set: Performance counter stats for './test_progs -t tailcalls' (1024 runs): 261.68 msec task-clock # 0.998 CPUs utilized ( +- 0.12% ) 5 context-switches # 0.017 K/sec ( +- 0.54% ) 0 cpu-migrations # 0.000 K/sec ( +- 23.37% ) 113 page-faults # 0.433 K/sec ( +- 0.03% ) 877,156,850 cycles # 3.352 GHz ( +- 0.11% ) (30.31%) 1,379,322,515 instructions # 1.57 insn per cycle ( +- 0.02% ) (38.17%) 218,869,567 branches # 836.395 M/sec ( +- 0.01% ) (38.46%) 11,954,183 branch-misses # 5.46% of all branches ( +- 0.01% ) (38.74%) 283,350,418 L1-dcache-loads # 1082.805 M/sec ( +- 0.01% ) (39.00%) 156,323 L1-dcache-load-misses # 0.06% of all L1-dcache hits ( +- 0.74% ) (39.05%) 37,309 LLC-loads # 0.143 M/sec ( +- 1.02% ) (31.08%) 15,263 LLC-load-misses # 40.91% of all LL-cache hits ( +- 0.90% ) (30.95%) <not supported> L1-icache-loads 130,427 L1-icache-load-misses ( +- 0.45% ) (30.80%) 285,369,370 dTLB-loads # 1090.520 M/sec ( +- 0.01% ) (30.64%) 1,154 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 1.26% ) (30.46%) 2,015 iTLB-loads # 0.008 M/sec ( +- 1.12% ) (30.31%) 551 iTLB-load-misses # 27.34% of all iTLB cache hits ( +- 1.29% ) (30.20%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.262276 +- 0.000316 seconds time elapsed ( +- 0.12% ) With: Performance counter stats for './test_progs -t tailcalls' (1024 runs): 362.37 msec task-clock # 0.671 CPUs utilized ( +- 0.11% ) 28 context-switches # 0.077 K/sec ( +- 0.15% ) 0 cpu-migrations # 0.001 K/sec ( +- 4.46% ) 113 page-faults # 0.313 K/sec ( +- 0.03% ) 895,804,416 cycles # 2.472 GHz ( +- 0.08% ) (30.50%) 1,339,401,398 instructions # 1.50 insn per cycle ( +- 0.04% ) (38.29%) 302,718,849 branches # 835.385 M/sec ( +- 0.04% ) (38.39%) 11,962,089 branch-misses # 3.95% of all branches ( +- 0.05% ) (38.56%) 248,044,443 L1-dcache-loads # 684.505 M/sec ( +- 0.03% ) (38.70%) 239,882 L1-dcache-load-misses # 0.10% of all L1-dcache hits ( +- 0.49% ) (38.69%) 76,904 LLC-loads # 0.212 M/sec ( +- 0.96% ) (30.88%) 23,472 LLC-load-misses # 30.52% of all LL-cache hits ( +- 0.98% ) (30.85%) <not supported> L1-icache-loads 193,803 L1-icache-load-misses ( +- 0.53% ) (30.81%) 249,775,412 dTLB-loads # 689.282 M/sec ( +- 0.04% ) (30.81%) 2,176 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 1.53% ) (30.73%) 2,914 iTLB-loads # 0.008 M/sec ( +- 1.23% ) (30.59%) 978 iTLB-load-misses # 33.57% of all iTLB cache hits ( +- 1.29% ) (30.48%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.540236 +- 0.000454 seconds time elapsed ( +- 0.08% ) Second conducted measurement was on BPF kselftest flow_dissector that is using the progs/bpf_flow.c with 'repeat' argument on bpf_prog_test_run_xattr set also to 1000000. Without: Performance counter stats for './test_progs -t flow_dissector' (1024 runs): 1,355.18 msec task-clock # 0.989 CPUs utilized ( +- 0.11% ) 28 context-switches # 0.021 K/sec ( +- 0.49% ) 0 cpu-migrations # 0.000 K/sec ( +- 7.86% ) 125 page-faults # 0.093 K/sec ( +- 0.03% ) 4,609,228,676 cycles # 3.401 GHz ( +- 0.03% ) (30.70%) 6,735,946,489 instructions # 1.46 insn per cycle ( +- 0.01% ) (38.42%) 1,130,187,926 branches # 833.979 M/sec ( +- 0.01% ) (38.47%) 29,150,986 branch-misses # 2.58% of all branches ( +- 0.01% ) (38.51%) 1,737,548,851 L1-dcache-loads # 1282.158 M/sec ( +- 0.01% ) (38.56%) 659,851 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 0.78% ) (38.56%) 71,196 LLC-loads # 0.053 M/sec ( +- 0.97% ) (30.81%) 22,218 LLC-load-misses # 31.21% of all LL-cache hits ( +- 0.83% ) (30.79%) <not supported> L1-icache-loads 770,586 L1-icache-load-misses ( +- 0.67% ) (30.77%) 1,742,104,224 dTLB-loads # 1285.520 M/sec ( +- 0.01% ) (30.74%) 7,060 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 2.08% ) (30.72%) 4,282 iTLB-loads # 0.003 M/sec ( +- 16.98% ) (30.70%) 1,261 iTLB-load-misses # 29.46% of all iTLB cache hits ( +- 7.14% ) (30.68%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1.37087 +- 0.00145 seconds time elapsed ( +- 0.11% ) With: Performance counter stats for './test_progs -t flow_dissector' (1024 runs): 1,385.56 msec task-clock # 0.989 CPUs utilized ( +- 0.06% ) 28 context-switches # 0.020 K/sec ( +- 0.48% ) 0 cpu-migrations # 0.000 K/sec ( +- 7.20% ) 125 page-faults # 0.091 K/sec ( +- 0.03% ) 4,642,599,630 cycles # 3.351 GHz ( +- 0.03% ) (30.69%) 6,901,261,616 instructions # 1.49 insn per cycle ( +- 0.01% ) (38.41%) 1,130,623,950 branches # 816.006 M/sec ( +- 0.01% ) (38.45%) 29,161,215 branch-misses # 2.58% of all branches ( +- 0.01% ) (38.50%) 1,796,850,740 L1-dcache-loads # 1296.842 M/sec ( +- 0.01% ) (38.55%) 673,908 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 0.89% ) (38.56%) 70,394 LLC-loads # 0.051 M/sec ( +- 1.08% ) (30.82%) 24,575 LLC-load-misses # 34.91% of all LL-cache hits ( +- 0.66% ) (30.80%) <not supported> L1-icache-loads 729,421 L1-icache-load-misses ( +- 0.85% ) (30.77%) 1,800,871,042 dTLB-loads # 1299.743 M/sec ( +- 0.01% ) (30.75%) 6,133 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 2.55% ) (30.73%) 1,998 iTLB-loads # 0.001 M/sec ( +- 9.36% ) (30.70%) 1,152 iTLB-load-misses # 57.66% of all iTLB cache hits ( +- 3.02% ) (30.68%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1.400577 +- 0.000780 seconds time elapsed ( +- 0.06% ) Interesting fact is that I observed the huge iTLB-load-misses counts on clean kernel as well: flow_dissector test: 2,613 iTLB-loads # 0.002 M/sec ( +- 21.90% ) (30.70%) 16,483 iTLB-load-misses # 630.78% of all iTLB cache hits ( +- 79.63% ) (30.68%) tailcalls test: 1,996 iTLB-loads # 0.008 M/sec ( +- 1.08% ) (30.33%) 7,272 iTLB-load-misses # 364.24% of all iTLB cache hits ( +- 92.01% ) (30.21%) So probably Alexei's suspicion about get_random_int() doing strange things was right. ------------------------------------------------------------------- Thank you, Maciej [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@xxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [2]: https://lore.kernel.org/bpf/20200511143912.34086-1-maciej.fijalkowski@xxxxxxxxx/ [3]: https://lore.kernel.org/bpf/20200702134930.4717-1-maciej.fijalkowski@xxxxxxxxx/ [4]: https://lore.kernel.org/bpf/20200715233634.3868-1-maciej.fijalkowski@xxxxxxxxx/ [5]: https://lore.kernel.org/netdev/20200721115321.3099-1-maciej.fijalkowski@xxxxxxxxx/ [6]: https://lore.kernel.org/bpf/20200731000324.2253-1-maciej.fijalkowski@xxxxxxxxx/ Maciej Fijalkowski (7): bpf, x64: use %rcx instead of %rax for tail call retpolines bpf: propagate poke descriptors to subprograms bpf: rename poke descriptor's 'ip' member to 'tailcall_target' bpf, x64: rework pro/epilogue and tailcall handling in JIT bpf: limit caller's stack depth 256 for subprogs with tailcalls bpf: allow for tailcalls in BPF subprograms for x64 JIT selftests: bpf: add dummy prog for bpf2bpf with tailcall arch/x86/include/asm/nospec-branch.h | 16 +- arch/x86/net/bpf_jit_comp.c | 249 +++++++++++++----- include/linux/bpf.h | 7 +- include/linux/bpf_verifier.h | 1 + kernel/bpf/arraymap.c | 55 +++- kernel/bpf/core.c | 3 +- kernel/bpf/verifier.c | 85 +++++- .../selftests/bpf/prog_tests/tailcalls.c | 85 ++++++ tools/testing/selftests/bpf/progs/tailcall6.c | 38 +++ 9 files changed, 453 insertions(+), 86 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c -- 2.20.1