Re: [PATCH bpf-next 4/4] riscv, bpf: Mixing bpf2bpf and tailcalls

Björn Töpel <bjorn@xxxxxxxxxx> · Tue, 30 Jan 2024 17:03:49 +0100

Pu Lehui <pulehui@xxxxxxxxxx> writes:

> On 2024/1/30 21:28, Björn Töpel wrote:
>> Pu Lehui <pulehui@xxxxxxxxxx> writes:
>> 
>>> On 2024/1/30 16:29, Björn Töpel wrote:
>>>> Pu Lehui <pulehui@xxxxxxxxxxxxxxx> writes:
>>>>
>>>>> On 2023/9/28 17:59, Björn Töpel wrote:
>>>>>> Pu Lehui <pulehui@xxxxxxxxxxxxxxx> writes:
>>>>>>
>>>>>>> From: Pu Lehui <pulehui@xxxxxxxxxx>
>>>>>>>
>>>>>>> In the current RV64 JIT, if we just don't initialize the TCC in subprog,
>>>>>>> the TCC can be propagated from the parent process to the subprocess, but
>>>>>>> the TCC of the parent process cannot be restored when the subprocess
>>>>>>> exits. Since the RV64 TCC is initialized before saving the callee saved
>>>>>>> registers into the stack, we cannot use the callee saved register to
>>>>>>> pass the TCC, otherwise the original value of the callee saved register
>>>>>>> will be destroyed. So we implemented mixing bpf2bpf and tailcalls
>>>>>>> similar to x86_64, i.e. using a non-callee saved register to transfer
>>>>>>> the TCC between functions, and saving that register to the stack to
>>>>>>> protect the TCC value. At the same time, we also consider the scenario
>>>>>>> of mixing trampoline.
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> The RISC-V JIT tries to minimize the stack usage, e.g. it doesn't have a
>>>>>> fixed pro/epilogue like some of the other JITs. I think we can do better
>>>>>> here, so that the pass-TCC-via-register can be used, and the additional
>>>>>> stack access can be avoided.
>>>>>>
>>>>>> Today, the TCC is passed via a register (a6) and can be viewed as a
>>>>>> "state" variable/transparent argument/return value. As you point out, we
>>>>>> loose this when we do a call. On (any) calls we move the TCC to a
>>>>>> callee-saved register.
>>>>>>
>>>>>> WDYT about the following scheme:
>>>>>>
>>>>>> 1 Pickup the arm64 bpf2bpf/tailmix mechanism of just clearing the TCC
>>>>>>      for the main program.
>>>>>> 2 For BPF helper calls, move TCC to s6, perform the call, and restore
>>>>>>      a6. Dito for kfunc calls (BPF_PSEUDO_KFUNC_CALL).
>>>>>> 3 For all other calls, a6 is passed transparently.
>>>>>>
>>>>>> For 2 bpf_jit_get_func_addr() can be used to determine if the callee is
>>>>>> a BPF helper or not.
>>>>>>
>>>>>> In summary; Determine in the JIT if we're leaving BPF-land, and need to
>>>>>> move the TCC to a callee-saved reg, or not, and save us a bunch of stack
>>>>>> store/loads.
>>>>>>
>>>>>
>>>>> Valuable scheme. But we need to consider TCC back propagation. Let me
>>>>> show an example of calling subprog with TCC stored in A6:
>>>>>
>>>>> prog1(TCC==1){
>>>>>        subprog1(TCC==1)
>>>>>            -> tailcall1(TCC==0)
>>>>>                -> subprog2(TCC==0)
>>>>>        subprog3(TCC==0) <--- should be TCC==1
>>>>>            -\-> tailcall2 <--- can't be called
>>>>> }
>>>
>>> Let's back with this example again. Imagine that the tailcall chain is a
>>> list limited to 33 elements. When the list has 32 elements, we call
>>> subprog1 and then tailcall1. At this time, the list elements count
>>> becomes 33. Then we call subprog2 and return prog1. At this time, the
>>> list removes 1 element and becomes 32 elements. At this time, there
>>> still can perform 1 tailcall.
>>>
>>> I've attached a diagram that shows mixing tailcall and subprogs is
>>> nearly a "call". It can return to caller function.
>> 
>> Hmm. Let me put my Q in another way.
>> 
>> The kernel calls into BPF_PROG_RUN() (~a BPF context). Would it ever be
>> OK to do more than 33 tail calls, regardless of subprogs or not?
>> 
>> In your example, TCC is 1. You are allowed to perform one tail call. In
>> your example prog1 performs two.
>> 
>> My view of TCC has always been ~a counter of the number of tailcalls~.
>> 
>> With your example expanded:
>> prog1(TCC==33){
>>        subprog1(TCC==33)
>>            -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> ... // 33 times
>>        // Lehui says TCC should be 33 again.
>>        // Björn says "it's the number of tailcalls", and subprog3 cannot perform a tail call
>>        subprog3(TCC==?)
>
> Yes, my view is take this something like a stack，while you take this as 
> a fixed global value.
>
> prog1(TCC==33){
>      subprog1(TCC==33)
>          -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> 
> tailcall1(TCC==31) -> ... // 33 times -> subprog2(TCC==0)
>      subprog3(TCC==33)
> 	-> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> 
> ... // 33 times
>
>>            
>> My view has, again, been than TCC is a run-time count of the number
>> tailcalls (fentry/fexit patch bpf-programs included).
>> 
>> What does x86 and arm64 do?
>
> When subprog return back to caller bpf program, they both restore TCC to 
> the value when enter into subprog. The ARM64 uses the callee saved 
> register to store the TCC. When the ARM64 exits, the TCC is restored to 
> the value when it enter. The while x86 uses the stack to do the same thing.

Ok! Thanks for clarifying. I'll continue reviewing the v2 of your
series!

BTW, I wonder if we can trigger this [1] on RV64 -- i.e. calling the
main prog, will reset the tcc count.

[1] https://lore.kernel.org/bpf/20240104142226.87869-1-hffilwlqm@xxxxxxxxx/