Pu Lehui <pulehui@xxxxxxxxxx> writes: > On 2024/1/30 21:28, Björn Töpel wrote: >> Pu Lehui <pulehui@xxxxxxxxxx> writes: >> >>> On 2024/1/30 16:29, Björn Töpel wrote: >>>> Pu Lehui <pulehui@xxxxxxxxxxxxxxx> writes: >>>> >>>>> On 2023/9/28 17:59, Björn Töpel wrote: >>>>>> Pu Lehui <pulehui@xxxxxxxxxxxxxxx> writes: >>>>>> >>>>>>> From: Pu Lehui <pulehui@xxxxxxxxxx> >>>>>>> >>>>>>> In the current RV64 JIT, if we just don't initialize the TCC in subprog, >>>>>>> the TCC can be propagated from the parent process to the subprocess, but >>>>>>> the TCC of the parent process cannot be restored when the subprocess >>>>>>> exits. Since the RV64 TCC is initialized before saving the callee saved >>>>>>> registers into the stack, we cannot use the callee saved register to >>>>>>> pass the TCC, otherwise the original value of the callee saved register >>>>>>> will be destroyed. So we implemented mixing bpf2bpf and tailcalls >>>>>>> similar to x86_64, i.e. using a non-callee saved register to transfer >>>>>>> the TCC between functions, and saving that register to the stack to >>>>>>> protect the TCC value. At the same time, we also consider the scenario >>>>>>> of mixing trampoline. >>>>>> >>>>>> Hi! >>>>>> >>>>>> The RISC-V JIT tries to minimize the stack usage, e.g. it doesn't have a >>>>>> fixed pro/epilogue like some of the other JITs. I think we can do better >>>>>> here, so that the pass-TCC-via-register can be used, and the additional >>>>>> stack access can be avoided. >>>>>> >>>>>> Today, the TCC is passed via a register (a6) and can be viewed as a >>>>>> "state" variable/transparent argument/return value. As you point out, we >>>>>> loose this when we do a call. On (any) calls we move the TCC to a >>>>>> callee-saved register. >>>>>> >>>>>> WDYT about the following scheme: >>>>>> >>>>>> 1 Pickup the arm64 bpf2bpf/tailmix mechanism of just clearing the TCC >>>>>> for the main program. >>>>>> 2 For BPF helper calls, move TCC to s6, perform the call, and restore >>>>>> a6. Dito for kfunc calls (BPF_PSEUDO_KFUNC_CALL). >>>>>> 3 For all other calls, a6 is passed transparently. >>>>>> >>>>>> For 2 bpf_jit_get_func_addr() can be used to determine if the callee is >>>>>> a BPF helper or not. >>>>>> >>>>>> In summary; Determine in the JIT if we're leaving BPF-land, and need to >>>>>> move the TCC to a callee-saved reg, or not, and save us a bunch of stack >>>>>> store/loads. >>>>>> >>>>> >>>>> Valuable scheme. But we need to consider TCC back propagation. Let me >>>>> show an example of calling subprog with TCC stored in A6: >>>>> >>>>> prog1(TCC==1){ >>>>> subprog1(TCC==1) >>>>> -> tailcall1(TCC==0) >>>>> -> subprog2(TCC==0) >>>>> subprog3(TCC==0) <--- should be TCC==1 >>>>> -\-> tailcall2 <--- can't be called >>>>> } >>> >>> Let's back with this example again. Imagine that the tailcall chain is a >>> list limited to 33 elements. When the list has 32 elements, we call >>> subprog1 and then tailcall1. At this time, the list elements count >>> becomes 33. Then we call subprog2 and return prog1. At this time, the >>> list removes 1 element and becomes 32 elements. At this time, there >>> still can perform 1 tailcall. >>> >>> I've attached a diagram that shows mixing tailcall and subprogs is >>> nearly a "call". It can return to caller function. >> >> Hmm. Let me put my Q in another way. >> >> The kernel calls into BPF_PROG_RUN() (~a BPF context). Would it ever be >> OK to do more than 33 tail calls, regardless of subprogs or not? >> >> In your example, TCC is 1. You are allowed to perform one tail call. In >> your example prog1 performs two. >> >> My view of TCC has always been ~a counter of the number of tailcalls~. >> >> With your example expanded: >> prog1(TCC==33){ >> subprog1(TCC==33) >> -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> ... // 33 times >> // Lehui says TCC should be 33 again. >> // Björn says "it's the number of tailcalls", and subprog3 cannot perform a tail call >> subprog3(TCC==?) > > Yes, my view is take this something like a stack,while you take this as > a fixed global value. > > prog1(TCC==33){ > subprog1(TCC==33) > -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> > tailcall1(TCC==31) -> ... // 33 times -> subprog2(TCC==0) > subprog3(TCC==33) > -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> > ... // 33 times > >> >> My view has, again, been than TCC is a run-time count of the number >> tailcalls (fentry/fexit patch bpf-programs included). >> >> What does x86 and arm64 do? > > When subprog return back to caller bpf program, they both restore TCC to > the value when enter into subprog. The ARM64 uses the callee saved > register to store the TCC. When the ARM64 exits, the TCC is restored to > the value when it enter. The while x86 uses the stack to do the same thing. Ok! Thanks for clarifying. I'll continue reviewing the v2 of your series! BTW, I wonder if we can trigger this [1] on RV64 -- i.e. calling the main prog, will reset the tcc count. [1] https://lore.kernel.org/bpf/20240104142226.87869-1-hffilwlqm@xxxxxxxxx/