Re: [PATCH bpf-next 4/4] riscv, bpf: Mixing bpf2bpf and tailcalls

Pu Lehui <pulehui@xxxxxxxxxx> · Tue, 30 Jan 2024 22:10:12 +0800

On 2024/1/30 21:28, Björn Töpel wrote:
Pu Lehui <pulehui@xxxxxxxxxx> writes:

On 2024/1/30 16:29, Björn Töpel wrote:
Pu Lehui <pulehui@xxxxxxxxxxxxxxx> writes:

On 2023/9/28 17:59, Björn Töpel wrote:
Pu Lehui <pulehui@xxxxxxxxxxxxxxx> writes:

From: Pu Lehui <pulehui@xxxxxxxxxx>

In the current RV64 JIT, if we just don't initialize the TCC in subprog,
the TCC can be propagated from the parent process to the subprocess, but
the TCC of the parent process cannot be restored when the subprocess
exits. Since the RV64 TCC is initialized before saving the callee saved
registers into the stack, we cannot use the callee saved register to
pass the TCC, otherwise the original value of the callee saved register
will be destroyed. So we implemented mixing bpf2bpf and tailcalls
similar to x86_64, i.e. using a non-callee saved register to transfer
the TCC between functions, and saving that register to the stack to
protect the TCC value. At the same time, we also consider the scenario
of mixing trampoline.

Hi!

The RISC-V JIT tries to minimize the stack usage, e.g. it doesn't have a
fixed pro/epilogue like some of the other JITs. I think we can do better
here, so that the pass-TCC-via-register can be used, and the additional
stack access can be avoided.

Today, the TCC is passed via a register (a6) and can be viewed as a
"state" variable/transparent argument/return value. As you point out, we
loose this when we do a call. On (any) calls we move the TCC to a
callee-saved register.

WDYT about the following scheme:

1 Pickup the arm64 bpf2bpf/tailmix mechanism of just clearing the TCC
     for the main program.
2 For BPF helper calls, move TCC to s6, perform the call, and restore
     a6. Dito for kfunc calls (BPF_PSEUDO_KFUNC_CALL).
3 For all other calls, a6 is passed transparently.

For 2 bpf_jit_get_func_addr() can be used to determine if the callee is
a BPF helper or not.

In summary; Determine in the JIT if we're leaving BPF-land, and need to
move the TCC to a callee-saved reg, or not, and save us a bunch of stack
store/loads.

Valuable scheme. But we need to consider TCC back propagation. Let me
show an example of calling subprog with TCC stored in A6:

prog1(TCC==1){
       subprog1(TCC==1)
           -> tailcall1(TCC==0)
               -> subprog2(TCC==0)
       subprog3(TCC==0) <--- should be TCC==1
           -\-> tailcall2 <--- can't be called
}

Let's back with this example again. Imagine that the tailcall chain is a
list limited to 33 elements. When the list has 32 elements, we call
subprog1 and then tailcall1. At this time, the list elements count
becomes 33. Then we call subprog2 and return prog1. At this time, the
list removes 1 element and becomes 32 elements. At this time, there
still can perform 1 tailcall.

I've attached a diagram that shows mixing tailcall and subprogs is
nearly a "call". It can return to caller function.

Hmm. Let me put my Q in another way.

The kernel calls into BPF_PROG_RUN() (~a BPF context). Would it ever be
OK to do more than 33 tail calls, regardless of subprogs or not?

In your example, TCC is 1. You are allowed to perform one tail call. In
your example prog1 performs two.

My view of TCC has always been ~a counter of the number of tailcalls~.

With your example expanded:
prog1(TCC==33){
       subprog1(TCC==33)
           -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> ... // 33 times
       // Lehui says TCC should be 33 again.
       // Björn says "it's the number of tailcalls", and subprog3 cannot perform a tail call
       subprog3(TCC==?)

Yes, my view is take this something like a stack，while you take this as 
a fixed global value.

prog1(TCC==33){
    subprog1(TCC==33)
        -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> 
tailcall1(TCC==31) -> ... // 33 times -> subprog2(TCC==0)
    subprog3(TCC==33)
	-> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> 
... // 33 times

My view has, again, been than TCC is a run-time count of the number
tailcalls (fentry/fexit patch bpf-programs included).

What does x86 and arm64 do?

When subprog return back to caller bpf program, they both restore TCC to 
the value when enter into subprog. The ARM64 uses the callee saved 
register to store the TCC. When the ARM64 exits, the TCC is restored to 
the value when it enter. The while x86 uses the stack to do the same thing.

Björn