On 12/12/23 02:02, Maciej Fijalkowski wrote: > On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote: >> >> >> On 6/12/23 07:03, Maciej Fijalkowski wrote: >>> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: >>>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall >>>> handling in JIT"), the tailcall on x64 works better than before. >>>> >>>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms >>>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64. >>>> >>>> How about: >>>> >>>> 1. More than 1 subprograms are called in a bpf program. >>>> 2. The tailcalls in the subprograms call the bpf program. >>>> >>>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy >>>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. >>>> >>>> As we know, in tail call context, the tail_call_cnt propagates by stack >>>> and rax register between BPF subprograms. So, propagating tail_call_cnt >>>> pointer by stack and rax register makes tail_call_cnt as like a global >>>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall >>>> hierarchy cases. >>>> >>>> Before jumping to other bpf prog, load tail_call_cnt from the pointer >>>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment >>>> tail_call_cnt by its pointer. >>>> >>>> But, where does tail_call_cnt store? >>>> >>>> It stores on the stack of bpf prog's caller, like >>>> >>>> | STACK | >>>> | | >>>> | rip | >>>> +->| tcc | >>>> | | rip | >>>> | | rbp | >>>> | +---------+ RBP >>>> | | | >>>> | | | >>>> | | | >>>> +--| tcc_ptr | >>>> | rbx | >>>> +---------+ RSP >>>> >>>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf >>>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" >>>> instruction on BPF JIT epilogue"). >>>> >>>> Why not back-propagate tail_call_cnt? >>>> >>>> It's because it's vulnerable to back-propagate it. It's unable to work >>>> well with the following case. >>>> >>>> int prog1(); >>>> int prog2(); >>>> >>>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate >>>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the >>>> same time? The answer is NO. Otherwise, there will be a register to be >>>> polluted, which will make kernel crash. >>> >>> Sorry but I keep on reading this explanation and I'm lost what is being >>> fixed here> >>> You want to limit the total amount of tail calls that entry prog can do to >>> MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is >>> rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is >>> to protect us from overflowing kernel stack and endless loops. As long a >>> single call chain doesn't go over 8kB program is fine. Verifier has a >>> limit of 256 subprogs from what I see. >> >> I try to resolve the following-like cases here. >> >> +--------- tailcall --+ >> | | >> V --> subprog1 -+ >> entry bpf prog < >> A --> subprog2 -+ >> | | >> +--------- tailcall --+ >> >> Without this fixing, the CPU will be stalled because of too many tailcalls. > > You still didn't explain why CPU might stall :/ if you stumble upon any > kernel splats it's good habit to include them. So, for future readers of > this thread, I reduced one of your examples down to: > Thanks for your help. Here is a GitHub CI history of this patchset: https://github.com/kernel-patches/bpf/pull/5807/checks In this CI results, the test_progs failed to run on aarch64 and s390x because of "rcu: INFO: rcu_sched self-detected stall on CPU". But why does CPU stall? It's because there are too many tailcalls to run on the bpf prog. How many tailcalls? Why MAX_TAIL_CALL_CNT limit does not work for those cases? Let's have a look at a simple case: #include <linux/bpf.h> #include <bpf/bpf_helpers.h> #include "bpf_legacy.h" struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); int count = 0; static __noinline int subprog_tail(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { volatile int ret = 1; count++; subprog_tail(skb); /* subprog call1 */ subprog_tail(skb); /* subprog call2 */ return ret; } char __license[] SEC("license") = "GPL"; And the entry bpf prog is populated to the 0th slot of jmp_table. Then, what happens when entry bpf prog runs? At the very first time when subprog_tail() is called, subprog_tail() does tailcall the entry bpf prog. Then, subprog_taill() is called at second time at the position subprog call1, and it tailcalls the entry bpf prog again. Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit works, subprog_tail() has been called for 34 times at the position subprog call1. And at this time, the tail_call_cnt is 33 in subprog_tail(). Next, the 34th subprog_tail() returns to entry() because of MAX_TAIL_CALL_CNT limit. In entry(), the 34th entry(), at the time after the 34th subprog_tail() at the position subprog call1 finishes and before the 1st subprog_tail() at the position subprog call2 calls in entry(), what's the value of tail_call_cnt in entry()? It's 33. As we know, tail_all_cnt is pushed on the stack of entry(), and propagates to subprog_tail() by %rax from stack. Then, at the time when subprog_tail() at the position subprog call2 is called for its first time, tail_call_cnt 33 propagates to subprog_tail() by %rax. And the tailcall in subprog_tail() is aborted because of tail_call_cnt >= MAX_TAIL_CALL_CNT too. Then, subprog_tail() at the position subprog call2 ends, and the 34th entry() ends. And it returns to the 33rd subprog_tail() called from the position subprog call1. But wait, at this time, what's the value of tail_call_cnt under the stack of subprog_tail()? It's 33. Then, in the 33rd entry(), at the time after the 33th subprog_tail() at the position subprog call1 finishes and before the 2nd subprog_tail() at the position subprog call2 calls, what's the value of tail_call_cnt in current entry()? It's **32**. Why not 33? Before stepping into subprog_tail() at the position subprog call2 in 33rd entry(), like stopping the time, let's have a look at the stack memory: | STACK | +---------+ RBP <-- current rbp | ret | STACK of 33rd entry() | tcc | its value is 32 +---------+ RSP <-- current rsp | rip | STACK of 34rd entry() | rbp | reuse the STACK of 33rd subprog_tail() at the position | ret | subprog call1 | tcc | its value is 33 +---------+ rsp | rip | STACK of 1st subprog_tail() at the position subprog call2 | rbp | | tcc | its value is 33 +---------+ rsp Why not 33? It's because tail_call_cnt does not back-propagate from subprog_tail() to entry(). Then, while stepping into subprog_tail() at the position subprog call2 in 33rd entry(): | STACK | +---------+ | ret | STACK of 33rd entry() | tcc | its value is 32 | rip | | rbp | +---------+ RBP <-- current rbp | tcc | its value is 32; STACK of subprog_tail() at the position +---------+ RSP <-- current rsp subprog call2 Then, while pausing after tailcalling in 2nd subprog_tail() at the position subprog call2: | STACK | +---------+ | ret | STACK of 33rd entry() | tcc | its value is 32 | rip | | rbp | +---------+ RBP <-- current rbp | tcc | its value is 33; STACK of subprog_tail() at the position +---------+ RSP <-- current rsp subprog call2 Note: what happens to tail_call_cnt: /* * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) * goto out; */ It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt. So, current tailcall is allowed to run. Then, entry() is tailcalled. And the stack memory status is: | STACK | +---------+ | ret | STACK of 33rd entry() | tcc | its value is 32 | rip | | rbp | +---------+ RBP <-- current rbp | ret | STACK of 35th entry(); reuse STACK of subprog_tail() at the | tcc | its value is 33 the position subprog call2 +---------+ RSP <-- current rsp So, the tailcalls in the 35th entry() will be aborted. And, ..., again and again. :( And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit does not work for those cases. And, how many tailcalls are there if CPU does not stall? Let's count it. 1+1+1+...+1=? Let's build a math model to calculate the total tailcall count. Sorry, I'm poor at math. How about top-down view? Does it look like hierarchy layer and layer? I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182 tailcalls. That's the guy which makes CPU stalled. What about there are N subprog_tail() in entry()? If CPU does not stall because of too many tailcalls, there will be almost N**34 tailcalls. > #include <linux/bpf.h> > #include <bpf/bpf_helpers.h> > #include "bpf_legacy.h" > > struct { > __uint(type, BPF_MAP_TYPE_PROG_ARRAY); > __uint(max_entries, 1); > __uint(key_size, sizeof(__u32)); > __uint(value_size, sizeof(__u32)); > } jmp_table SEC(".maps"); > > static __noinline > int subprog_tail(struct __sk_buff *skb) > { > bpf_tail_call_static(skb, &jmp_table, 0); > return 0; > } > > SEC("tc") > int entry(struct __sk_buff *skb) > { > subprog_tail(skb); > return 0; > } > > char __license[] SEC("license") = "GPL"; > > and indeed CPU got stuck: > > [10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59 > [10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc > [10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286 > [10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000 > [10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100 > [10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000 > [10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001 > [10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f > [10623.960174] ? bpf_test_run+0xef/0x320 > [10623.963988] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.969826] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.975662] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.981500] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10623.987334] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10623.993174] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.999016] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.004854] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.010688] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.016529] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.022371] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.028209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.034043] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.043211] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.052344] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.061453] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.070404] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.079200] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.087841] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.096333] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.104680] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.112886] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.121014] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.129048] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.136949] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.144716] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.152479] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.160209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.167905] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.175565] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.183173] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.190738] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.198262] bpf_test_run+0x186/0x320 > [10624.203677] ? kmem_cache_alloc+0x14d/0x2c0 > [10624.209615] bpf_prog_test_run_skb+0x35c/0x710 > [10624.215810] __sys_bpf+0xf3a/0x2d80 > [10624.221068] ? do_mmap+0x409/0x5e0 > [10624.226217] __x64_sys_bpf+0x1a/0x30 > [10624.231693] do_syscall_64+0x2e/0xa0 > [10624.236988] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > [10624.243550] RIP: 0033:0x7fd76171e69d > [10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48 > [10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 > [10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d > [10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a > [10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0 > [10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348 > [10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040 > > Looking at assembly code: > > entry: > ffffffffc0012c60 <load3+0x12c60>: > ffffffffc0012c60: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012c65: 31 c0 xor %eax,%eax > ffffffffc0012c67: 55 push %rbp > ffffffffc0012c68: 48 89 e5 mov %rsp,%rbp > ffffffffc0012c6b: 50 push %rax > ffffffffc0012c6c: 48 8b 85 f8 ff ff ff mov -0x8(%rbp),%rax > ffffffffc0012c73: e8 30 00 00 00 call 0xffffffffc0012ca8 > ffffffffc0012c78: 31 c0 xor %eax,%eax > ffffffffc0012c7a: c9 leave > ffffffffc0012c7b: c3 ret > > subprog_tail: > ffffffffc0012ca8 <load3+0x12ca8>: > ffffffffc0012ca8: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012cad: 66 90 xchg %ax,%ax > ffffffffc0012caf: 55 push %rbp > ffffffffc0012cb0: 48 89 e5 mov %rsp,%rbp > ffffffffc0012cb3: 50 push %rax > ffffffffc0012cb4: 53 push %rbx > ffffffffc0012cb5: 41 55 push %r13 > ffffffffc0012cb7: 48 89 fb mov %rdi,%rbx > ffffffffc0012cba: 49 bd 00 7c 73 0c 81 movabs $0xffff88810c737c00,%r13 > ffffffffc0012cc1: 88 ff ff > ffffffffc0012cc4: 48 89 df mov %rbx,%rdi > ffffffffc0012cc7: 4c 89 ee mov %r13,%rsi > ffffffffc0012cca: 31 d2 xor %edx,%edx > ffffffffc0012ccc: 8b 85 fc ff ff ff mov -0x4(%rbp),%eax > ffffffffc0012cd2: 83 f8 21 cmp $0x21,%eax > ffffffffc0012cd5: 73 17 jae 0xffffffffc0012cee > ffffffffc0012cd7: 83 c0 01 add $0x1,%eax > ffffffffc0012cda: 89 85 fc ff ff ff mov %eax,-0x4(%rbp) > ffffffffc0012ce0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012ce5: 41 5d pop %r13 > ffffffffc0012ce7: 5b pop %rbx > ffffffffc0012ce8: 58 pop %rax > ffffffffc0012ce9: e9 7d ff ff ff jmp 0xffffffffc0012c6b > ffffffffc0012cee: 41 5d pop %r13 > ffffffffc0012cf0: 5b pop %rbx > ffffffffc0012cf1: c9 leave > ffffffffc0012cf2: c3 ret > > I am trying to understand what are writing to %rax when entry is target of > tailcall. rbp comes from subprog_tail and we popped all of the previously > pushed regs. So what is exactly causing this hang? > Oh, your example is wierd to make CPU stuck. I don't know why, either. > (I thought I'd rather respond just to let you know I'm on it) > Thanks for your reminding. >> >>> >>> Can you elaborate a bit more about the kernel crash you mention in the >>> last paragraph? >> >> We have progs, prog1, prog2, prog3 and prog4, and the scenario: >> >> case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3 >> case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3 >> case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3 >> --> subprog4 --tailcall-> prog4 >> >> How does prog2 back-propagate tail_call_cnt to prog1? >> >> Possible way 1: >> When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to >> back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do >> not handle the value in RCX register, like case2. >> >> Possible way 2: >> Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0 >> at epilogue when current prog has tailcall? >> No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt, >> because prog3 has no tailcall and won't populate RCX register with tail_call_cnt. > > I don't get it. Let's start with small example and do the baby steps. > I lost the details of back-propagating solutions which I tried hard to convince myself the solutions were not good enough as my expectation. Sorry about that. >> >> However, I don't like the back-propagating way. Then, I "burn" my brain to figure >> out pointer propagating ways. > > I know that code can damage programmer's brain, that's why we need to > strive for elaborative and easy to understand commit messages so that > later on we will be able to pick up what is going on here. > Yeah, it's not easy to understand the code of tailcall on arch, like x86. I'm trying hard to explain this patch. >> >> RFC PATCH v1 way: >> Propagate tail_call_cnt and its pointer together. Then, the pointer is used to >> check MAX_TAIL_CALL_CNT and increment tail_call_cnt. >> >> | STACK | >> +---------+ RBP >> | | >> | | >> | | >> +--| tcc_ptr | >> +->| tcc | >> | rbx | >> +---------+ RSP >> >> RFC PATCH v2 way (current patchset): >> Propagate tail_call_cnt pointer only. Then, the pointer is used to check >> MAX_TAIL_CALL_CNT and increment tail_call_cnt. >> >> | STACK | >> | | >> | rip | >> +->| tcc | >> | | rip | >> | | rbp | >> | +---------+ RBP >> | | | >> | | | >> | | | >> +--| tcc_ptr | >> | rbx | >> +---------+ RSP >> It's time to describe the details of this patch. And take a look at how it handles the above simple case. This patch is to keep tail_call_cnt on the stack of entry bpf prog's caller, and then propagate tail_call_cnt pointer like the way of the previous tail_call_cnt. So, at the epilogue of entry bpf prog, before pushing %rbp, it has to initialise tail_call_cnt and to push it to stack by using %rax. Then, make %rax as the pointer that points to tail_call_cnt, by "mov rax, rsp". Then, call the main part of the entry bpf prog by "call 2". **This is the exceptional point.** With this "call", %rip is pushed to stack. And at the end of the entry bpf prog runtime, the %rip is popped from stack; then, pop tail_call_cnt from stack too; and finally "ret" again. The popping tail_call_cnt and "ret" is the 2 in "call 2". So, when "call 2", %rax is the tail_call_cnt pointer. And then, tail_call_cnt pointer will be pushed to stack. This is the way to push %rax when entry bpf prog is tailcalled, too. So, when a subprog is called, tail_call_cnt pointer is popped from stack to %rax. And when a tailcall happens, load tail_call_cnt pointer from stack to %rax by "mov rax, qword ptr [rbp - tcc_ptr_off]", and compare tail_call_cnt with MAX_TAIL_CALL_CNT by "cmp dword ptr [rax], MAX_TAIL_CALL_CNT", and then increment tail_call_cnt by "add dword ptr [rax], 1". Finally, when pop %rax, it's to pop tail_call_cnt pointer from stack to %rax. When the epilogue of entry() runs, the stack of entry() should be like: | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 0 | | rip | | | rbp | | +---------+ RBP | | ret | STACK of entry() +--| tcc_ptr | | rbx | saved regs +---------+ RSP Then, when subprog_tail() is called for its very first time, its stack should be like: | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 0 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ RBP +--| tcc_ptr | STACK of subprog_tail() +---------+ RSP Then, when subprog_tail() tailcalls entry(): | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 1 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ RBP | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | +---------+ RSP Then, when entry() calls subprog_tail(): | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 1 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ RBP +--| tcc_ptr | STACK of subprog_tail() +---------+ RSP Then, when subprog_tail() tailcalls entry(): | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 2 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ RBP | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | +---------+ RSP Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit works, subprog_tail() has been called for 34 times at the position subprog call1. And at this time, the stack should be like: | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 33 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ rbp | | * | | | * | | | * | | +---------+ RBP +--| tcc_ptr | STACK of subprog_tail() +---------+ RSP After this time, the tailcalls in the future will be aborted because tail_call_cnt has been 33, which reaches its MAX_TAIL_CALL_CNT limit. That's the way how this patch works. It's really nice if you reach here. I hope you have a clear idea to understand this patch by reading above replys. >>> >>> I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has >>> changed in the meantime to 33 and we should adjust the max allowed stack >>> depth of subprogs? I believe this was brought up at LPC? >> >> There's following code snippet in verifier.c: >> >> /* protect against potential stack overflow that might happen when >> * bpf2bpf calls get combined with tailcalls. Limit the caller's stack >> * depth for such case down to 256 so that the worst case scenario >> * would result in 8k stack size (32 which is tailcall limit * 256 = >> * 8k). >> >> But, MAX_TAIL_CALL_CNT is 33. >> >> This was not brought up at LPC 2022&2023. I don't know whether this was >> brought up at previous LPCs. > > Was referring to this: > https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf > That's interesting. I tried to overflow the stack of one bpf prog, whose max stack size limits to 8k, with fexit. I failed because the stack size of thread is not so that small. But, exceptionally, it was able to compose a tailcall infinite loop with fexit, which broke the tail_call_cnt propagating chain because it did not pass through the fexit's trampoline. Thanks, Leon > but let us focus on issue you're bringing up here in this patch. I brought > this up thinking JIT code was fine, now it's clear that things go south > when entry prog is tailcall target, which we didn't test. > >> >> Thanks, >> Leon >>