On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote: > > > On 6/12/23 07:03, Maciej Fijalkowski wrote: > > On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: > >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall > >> handling in JIT"), the tailcall on x64 works better than before. > >> > >> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms > >> for x64 JIT"), tailcall is able to run in BPF subprograms on x64. > >> > >> How about: > >> > >> 1. More than 1 subprograms are called in a bpf program. > >> 2. The tailcalls in the subprograms call the bpf program. > >> > >> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy > >> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. > >> > >> As we know, in tail call context, the tail_call_cnt propagates by stack > >> and rax register between BPF subprograms. So, propagating tail_call_cnt > >> pointer by stack and rax register makes tail_call_cnt as like a global > >> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall > >> hierarchy cases. > >> > >> Before jumping to other bpf prog, load tail_call_cnt from the pointer > >> and then compare with MAX_TAIL_CALL_CNT. Finally, increment > >> tail_call_cnt by its pointer. > >> > >> But, where does tail_call_cnt store? > >> > >> It stores on the stack of bpf prog's caller, like > >> > >> | STACK | > >> | | > >> | rip | > >> +->| tcc | > >> | | rip | > >> | | rbp | > >> | +---------+ RBP > >> | | | > >> | | | > >> | | | > >> +--| tcc_ptr | > >> | rbx | > >> +---------+ RSP > >> > >> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf > >> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" > >> instruction on BPF JIT epilogue"). > >> > >> Why not back-propagate tail_call_cnt? > >> > >> It's because it's vulnerable to back-propagate it. It's unable to work > >> well with the following case. > >> > >> int prog1(); > >> int prog2(); > >> > >> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate > >> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the > >> same time? The answer is NO. Otherwise, there will be a register to be > >> polluted, which will make kernel crash. > > > > Sorry but I keep on reading this explanation and I'm lost what is being > > fixed here> > > You want to limit the total amount of tail calls that entry prog can do to > > MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is > > rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is > > to protect us from overflowing kernel stack and endless loops. As long a > > single call chain doesn't go over 8kB program is fine. Verifier has a > > limit of 256 subprogs from what I see. > > I try to resolve the following-like cases here. > > +--------- tailcall --+ > | | > V --> subprog1 -+ > entry bpf prog < > A --> subprog2 -+ > | | > +--------- tailcall --+ > > Without this fixing, the CPU will be stalled because of too many tailcalls. You still didn't explain why CPU might stall :/ if you stumble upon any kernel splats it's good habit to include them. So, for future readers of this thread, I reduced one of your examples down to: #include <linux/bpf.h> #include <bpf/bpf_helpers.h> #include "bpf_legacy.h" struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); static __noinline int subprog_tail(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { subprog_tail(skb); return 0; } char __license[] SEC("license") = "GPL"; and indeed CPU got stuck: [10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59 [10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc [10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286 [10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000 [10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100 [10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000 [10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001 [10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f [10623.960174] ? bpf_test_run+0xef/0x320 [10623.963988] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.969826] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.975662] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.981500] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10623.987334] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10623.993174] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.999016] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.004854] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.010688] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.016529] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.022371] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.028209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.034043] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.043211] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.052344] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.061453] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.070404] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.079200] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.087841] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.096333] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.104680] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.112886] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.121014] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.129048] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.136949] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.144716] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.152479] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.160209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.167905] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.175565] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.183173] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.190738] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.198262] bpf_test_run+0x186/0x320 [10624.203677] ? kmem_cache_alloc+0x14d/0x2c0 [10624.209615] bpf_prog_test_run_skb+0x35c/0x710 [10624.215810] __sys_bpf+0xf3a/0x2d80 [10624.221068] ? do_mmap+0x409/0x5e0 [10624.226217] __x64_sys_bpf+0x1a/0x30 [10624.231693] do_syscall_64+0x2e/0xa0 [10624.236988] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [10624.243550] RIP: 0033:0x7fd76171e69d [10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48 [10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d [10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a [10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0 [10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348 [10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040 Looking at assembly code: entry: ffffffffc0012c60 <load3+0x12c60>: ffffffffc0012c60: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012c65: 31 c0 xor %eax,%eax ffffffffc0012c67: 55 push %rbp ffffffffc0012c68: 48 89 e5 mov %rsp,%rbp ffffffffc0012c6b: 50 push %rax ffffffffc0012c6c: 48 8b 85 f8 ff ff ff mov -0x8(%rbp),%rax ffffffffc0012c73: e8 30 00 00 00 call 0xffffffffc0012ca8 ffffffffc0012c78: 31 c0 xor %eax,%eax ffffffffc0012c7a: c9 leave ffffffffc0012c7b: c3 ret subprog_tail: ffffffffc0012ca8 <load3+0x12ca8>: ffffffffc0012ca8: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012cad: 66 90 xchg %ax,%ax ffffffffc0012caf: 55 push %rbp ffffffffc0012cb0: 48 89 e5 mov %rsp,%rbp ffffffffc0012cb3: 50 push %rax ffffffffc0012cb4: 53 push %rbx ffffffffc0012cb5: 41 55 push %r13 ffffffffc0012cb7: 48 89 fb mov %rdi,%rbx ffffffffc0012cba: 49 bd 00 7c 73 0c 81 movabs $0xffff88810c737c00,%r13 ffffffffc0012cc1: 88 ff ff ffffffffc0012cc4: 48 89 df mov %rbx,%rdi ffffffffc0012cc7: 4c 89 ee mov %r13,%rsi ffffffffc0012cca: 31 d2 xor %edx,%edx ffffffffc0012ccc: 8b 85 fc ff ff ff mov -0x4(%rbp),%eax ffffffffc0012cd2: 83 f8 21 cmp $0x21,%eax ffffffffc0012cd5: 73 17 jae 0xffffffffc0012cee ffffffffc0012cd7: 83 c0 01 add $0x1,%eax ffffffffc0012cda: 89 85 fc ff ff ff mov %eax,-0x4(%rbp) ffffffffc0012ce0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012ce5: 41 5d pop %r13 ffffffffc0012ce7: 5b pop %rbx ffffffffc0012ce8: 58 pop %rax ffffffffc0012ce9: e9 7d ff ff ff jmp 0xffffffffc0012c6b ffffffffc0012cee: 41 5d pop %r13 ffffffffc0012cf0: 5b pop %rbx ffffffffc0012cf1: c9 leave ffffffffc0012cf2: c3 ret I am trying to understand what are writing to %rax when entry is target of tailcall. rbp comes from subprog_tail and we popped all of the previously pushed regs. So what is exactly causing this hang? (I thought I'd rather respond just to let you know I'm on it) > > > > > Can you elaborate a bit more about the kernel crash you mention in the > > last paragraph? > > We have progs, prog1, prog2, prog3 and prog4, and the scenario: > > case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3 > case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3 > case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3 > --> subprog4 --tailcall-> prog4 > > How does prog2 back-propagate tail_call_cnt to prog1? > > Possible way 1: > When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to > back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do > not handle the value in RCX register, like case2. > > Possible way 2: > Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0 > at epilogue when current prog has tailcall? > No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt, > because prog3 has no tailcall and won't populate RCX register with tail_call_cnt. I don't get it. Let's start with small example and do the baby steps. > > However, I don't like the back-propagating way. Then, I "burn" my brain to figure > out pointer propagating ways. I know that code can damage programmer's brain, that's why we need to strive for elaborative and easy to understand commit messages so that later on we will be able to pick up what is going on here. > > RFC PATCH v1 way: > Propagate tail_call_cnt and its pointer together. Then, the pointer is used to > check MAX_TAIL_CALL_CNT and increment tail_call_cnt. > > | STACK | > +---------+ RBP > | | > | | > | | > +--| tcc_ptr | > +->| tcc | > | rbx | > +---------+ RSP > > RFC PATCH v2 way (current patchset): > Propagate tail_call_cnt pointer only. Then, the pointer is used to check > MAX_TAIL_CALL_CNT and increment tail_call_cnt. > > | STACK | > | | > | rip | > +->| tcc | > | | rip | > | | rbp | > | +---------+ RBP > | | | > | | | > | | | > +--| tcc_ptr | > | rbx | > +---------+ RSP > > > > > I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has > > changed in the meantime to 33 and we should adjust the max allowed stack > > depth of subprogs? I believe this was brought up at LPC? > > There's following code snippet in verifier.c: > > /* protect against potential stack overflow that might happen when > * bpf2bpf calls get combined with tailcalls. Limit the caller's stack > * depth for such case down to 256 so that the worst case scenario > * would result in 8k stack size (32 which is tailcall limit * 256 = > * 8k). > > But, MAX_TAIL_CALL_CNT is 33. > > This was not brought up at LPC 2022&2023. I don't know whether this was > brought up at previous LPCs. Was referring to this: https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf but let us focus on issue you're bringing up here in this patch. I brought this up thinking JIT code was fine, now it's clear that things go south when entry prog is tailcall target, which we didn't test. > > Thanks, > Leon >