Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs

Yonghong Song <yonghong.song@xxxxxxxxx> · Mon, 22 Jul 2024 09:43:04 -0700

On 7/19/24 8:28 PM, Andrii Nakryiko wrote:
On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
The main motivation for private stack comes from nested
scheduler in sched-ext from Tejun. The basic idea is that
  - each cgroup will its own associated bpf program,
  - bpf program with parent cgroup will call bpf programs
    in immediate child cgroups.

Let us say we have the following cgroup hierarchy:
   root_cg (prog0):
     cg1 (prog1):
       cg11 (prog11):
         cg111 (prog111)
         cg112 (prog112)
       cg12 (prog12):
         cg121 (prog121)
         cg122 (prog122)
     cg2 (prog2):
       cg21 (prog21)
       cg22 (prog22)
       cg23 (prog23)

In the above example, prog0 will call a kfunc which will
call prog1 and prog2 to get sched info for cg1 and cg2 and
then the information is summarized and sent back to prog0.
Similarly, prog11 and prog12 will be invoked in the kfunc
and the result will be summarized and sent back to prog1, etc.

Currently, for each thread, the x86 kernel allocate 8KB stack.
The each bpf program (including its subprograms) has maximum
512B stack size to avoid potential stack overflow.
And nested bpf programs increase the risk of stack overflow.
To avoid potential stack overflow caused by bpf programs,
this patch implemented a private stack so bpf program stack
space is allocated dynamically when the program is jited.
Such private stack is applied to tracing programs like
kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
tracing.

But more than one instance of the same bpf program may
run in the system. To make things simple, percpu private
stack is allocated for each program, so if the same program
is running on different cpus concurrently, we won't have
any issue. Note that the kernel already have logic to prevent
the recursion for the same bpf program on the same cpu
(kprobe, fentry, etc.).

The patch implemented a percpu private stack based approach
for x86 arch.
   - The stack size will be 0 and any stack access is from
     jit-time allocated percpu storage.
   - In the beginning of jit, r9 is used to save percpu
     private stack pointer.
   - Each rbp in the bpf asm insn is replaced by r9.
   - For each call, push r9 before the call and pop r9
     after the call to preserve r9 value.

Compared to previous RFC patch [1], this patch added
some conditions to enable private stack, e.g., verifier
calculated stack size, prog type, etc. The new patch
also added a performance test to compare private stack
vs. no private stack.

The following are some code example to illustrate the idea
for selftest cgroup_skb_sk_lookup:

    the existing code                        the private-stack approach code
    endbr64                                  endbr64
    nop    DWORD PTR [rax+rax*1+0x0]         nop    DWORD PTR [rax+rax*1+0x0]
    xchg   ax,ax                             xchg   ax,ax
    push   rbp                               push   rbp
    mov    rbp,rsp                           mov    rbp,rsp
    endbr64                                  endbr64
    sub    rsp,0x68
    push   rbx                               push   rbx
    ...                                      ...
    ...                                      mov    r9d,0x8c1c860
    ...                                      add    r9,QWORD PTR gs:0x21a00
    ...                                      ...
    mov    rdx,rbp                           mov    rdx, r9
    add    rdx,0xffffffffffffffb4            rdx,0xffffffffffffffb4
    ...                                      ...
    mov    ecx,0x28                          mov    ecx,0x28
                                             push   r9
    call   0xffffffffe305e474                call   0xffffffffe305e524
                                             pop    r9
    mov    rdi,rax                           mov    rdi,rax
    ...                                      ...
    movzx  rdi,BYTE PTR [rbp-0x46]           movzx  rdi,BYTE PTR [r9-0x46]
    ...                                      ...

Eduard nerd-sniped me today with this a bit... :)

I have a few questions and suggestions.

So it seems like each *subprogram* (not the entire BPF program) gets
its own per-CPU private stack allocation. Is that intentional? That

Currently yes. The reason is the same prog could be run on different
cpus at the same time.

seems a bit unnecessary. It also prevents any sort of actual
recursion. Not sure if it's possible to write recursive BPF subprogram
today, verifier seems to reject obvious limited recursion cases, but
still, eventually we might need/want to support that, and this will be
just another hurdle to overcome (so it's best to avoid adding it in
the first place).

I'm sure Eduard is going to try something like below and it will
probably break badly (I haven't tried, sorry):

int entry(void *ctx);

struct {
         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
         __uint(max_entries, 1);
         __uint(key_size, sizeof(__u32));
         __array(values, int (void *));
} prog_array_init SEC(".maps") = {
         .values = {
                 [0] = (void *)&entry,
         },
};

static __noinline int subprog1(void)
{
     <some state on the stack>

     /* here entry will replace subprog1, and so we'll have
      * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
      */
     bpf_tail_call(ctx, &prog_array_init, 0);

     return 0;
}

SEC("raw_tp/sys_enter")
int entry(void *ctx)
{
      <some state on the stack>

      subprog1();
}

And we effectively have limited recursion where the entry's stack
state is clobbered, no?

So it seems like we need to support recursion.

So, the question I have is. Why not do the following:
a) only setup r9 *once* in entry program's prologue (before tail call
jump target)
b) before each call we can adjust r9 with current prog/subprog's
maximum *own* stack, something like:

push r9;
r9 += 128; // 128 is subprog's stack usage
call <some-subprog>
pop r9;

The idea being that on tail call or in subprog call we assume r9 is
already pointing to the right place. We can probably also figure out
how to avoid push/pop r9 if we make sure that subprogram always
restores r9 (taking tail calls into account and all that, of course)?

Is this feasible?

This is possible. I actually hacked such an idea easily. The basic
idea is push frame pointer as an additional argument to the bpf
static sub-prog. This is a little bit complicated. It will probably
save some stack size but I am not sure how much it is.

Another question I have is whether it would be possible to just plain
set rbp to private stack and keep using rbp in such a way that stack
traces are preserved? I.e., save return address on private stack to
unwinders can correctly jump back to kernel's stack?

I also tried this approach earlier. But it is very trickly we need to
modify rbp/rsp and additional jit code will be added
If interrupts happens, we will not be able to get reliable stack trace.

How stupid is what I propose above?

So the number of insns is increased by 1 + num_of_calls * 2.
Here the number of calls are those calls in the final jited binary.
Comparing function call itself, the push/pop overhead should be
minimum in most common cases.

Our original use case is for sched-ext nested scheduler. This will be done
in the future.

   [1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@xxxxxxxxx/T/

Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
---
  arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++---
  include/linux/bpf.h         |  2 ++
  kernel/bpf/core.c           | 20 ++++++++++++
  kernel/bpf/syscall.c        |  1 +
  4 files changed, 82 insertions(+), 4 deletions(-)

[...]