Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs

Yonghong Song <yonghong.song@xxxxxxxxx> · Wed, 24 Jul 2024 10:56:07 -0700

On 7/24/24 9:54 AM, Alexei Starovoitov wrote:
On Tue, Jul 23, 2024 at 10:09 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

Discussed with Andrii. I think the following approach should work.
For each non-static prog, the private stack is allocated including
that non-static prog and the called static progs. For example,
      main_prog
         static_prog_1
           static_prog_11
           global_prog
              static_prog_12
         static_prog_2

So in verifier we calculate stack size for
      main_prog
         static_prog_1
            static_prog_11
         static_prog_2
   and
      global_prog
        static_prog_12

Let us say the stack size for main_prog like below for each (sub)prog
      main_prog // stack size 100
         static_prog_1 // stack size 100
           static_prog_11 // stack size 100
         static_prog_2 // static size 100
so total static size is 300 so the private stack size will be 300.
So R9 is calculated like below
      main_prog
        R9 = ... // for tailcall reachable, R9 may be original R9 + offset
                 // for non-tailcall reachable, R9 equals the original R9 (based on jit-time allocation).
        ...  R9 ...
        R9 += 100
        static_prog_1
           ... R9 ...
           R9 += 100
           static_prog_11
             ... R9 ...
           R9 -= 100
        R9 -= 100
        ... R9 ...
        R9 += 100
        static_prog_2
           ... R9 ...
        R9 -= 100

Similary, we can calculate R9 offset for
      global_prog
        static_prog_12
as well.
I don't understand why differentiate static and global surprogs.

Specially handling global subprog is for potential BPF_PROG_TYPE_EXT
prog which may replace global subprog.

Therefore, so private stack, global subprog will terminate
stack accounting to minimize stack usage. If we treat
static/global subprogs the same, and if freplace does happen,
we might allocate more-than-necessary private stack.

freplace probably not a common use case. If it does happen,
the original global subprog may be a stub func which does
not have any stack usage and the freplace prog is the one
implementing the business logic. So from that perspective,
we do not need to differentiate static and global subprogs.

But, mainly, += and -= around the call is suboptimal.
Can we do it as a normal stack does ?
Each prog knows how much stack it needs,
so it can do:
r9 += stack_depth in the prologue
and all accesses are done as r9 - off.
Then to do a call nothing extra is needed.
The callee will do r9 += its own stack depth.

I thought the += and -= at call site are easier to understand.
But certainly, doing r9 += stack_depth and
r9 -= stack_depth inside the subprog works too.

Whether private stack growth up or down is tbd.

My current approach is that private stack growth down
similar to normal stack. But we have flexibility
to grow up at frame level.

The challenge is how to supply proper r9 on entry
into the main prog. Potentially a task for bpf trampoline,
and kprobe/tp/etc will need special hack in bpf_dispatcher_nop_func.

I have an early hack for bpf trampoline and
bpf_dispatcher_nop_func to pass private stack pointer
as the third argument to the bpf program.
In this particular case, we can just pass private
stack pointer in R9. I will pick this up.