Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 22 Jul 2024 18:05:51 -0700

On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> > >
> > > The main motivation for private stack comes from nested
> > > scheduler in sched-ext from Tejun. The basic idea is that
> > >  - each cgroup will its own associated bpf program,
> > >  - bpf program with parent cgroup will call bpf programs
> > >    in immediate child cgroups.
> > >
> > > Let us say we have the following cgroup hierarchy:
> > >   root_cg (prog0):
> > >     cg1 (prog1):
> > >       cg11 (prog11):
> > >         cg111 (prog111)
> > >         cg112 (prog112)
> > >       cg12 (prog12):
> > >         cg121 (prog121)
> > >         cg122 (prog122)
> > >     cg2 (prog2):
> > >       cg21 (prog21)
> > >       cg22 (prog22)
> > >       cg23 (prog23)
> > >
> > > In the above example, prog0 will call a kfunc which will
> > > call prog1 and prog2 to get sched info for cg1 and cg2 and
> > > then the information is summarized and sent back to prog0.
> > > Similarly, prog11 and prog12 will be invoked in the kfunc
> > > and the result will be summarized and sent back to prog1, etc.
> > >
> > > Currently, for each thread, the x86 kernel allocate 8KB stack.
> > > The each bpf program (including its subprograms) has maximum
> > > 512B stack size to avoid potential stack overflow.
> > > And nested bpf programs increase the risk of stack overflow.
> > > To avoid potential stack overflow caused by bpf programs,
> > > this patch implemented a private stack so bpf program stack
> > > space is allocated dynamically when the program is jited.
> > > Such private stack is applied to tracing programs like
> > > kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
> > > tracing.
> > >
> > > But more than one instance of the same bpf program may
> > > run in the system. To make things simple, percpu private
> > > stack is allocated for each program, so if the same program
> > > is running on different cpus concurrently, we won't have
> > > any issue. Note that the kernel already have logic to prevent
> > > the recursion for the same bpf program on the same cpu
> > > (kprobe, fentry, etc.).
> > >
> > > The patch implemented a percpu private stack based approach
> > > for x86 arch.
> > >   - The stack size will be 0 and any stack access is from
> > >     jit-time allocated percpu storage.
> > >   - In the beginning of jit, r9 is used to save percpu
> > >     private stack pointer.
> > >   - Each rbp in the bpf asm insn is replaced by r9.
> > >   - For each call, push r9 before the call and pop r9
> > >     after the call to preserve r9 value.
> > >
> > > Compared to previous RFC patch [1], this patch added
> > > some conditions to enable private stack, e.g., verifier
> > > calculated stack size, prog type, etc. The new patch
> > > also added a performance test to compare private stack
> > > vs. no private stack.
> > >
> > > The following are some code example to illustrate the idea
> > > for selftest cgroup_skb_sk_lookup:
> > >
> > >    the existing code                        the private-stack approach code
> > >    endbr64                                  endbr64
> > >    nop    DWORD PTR [rax+rax*1+0x0]         nop    DWORD PTR [rax+rax*1+0x0]
> > >    xchg   ax,ax                             xchg   ax,ax
> > >    push   rbp                               push   rbp
> > >    mov    rbp,rsp                           mov    rbp,rsp
> > >    endbr64                                  endbr64
> > >    sub    rsp,0x68
> > >    push   rbx                               push   rbx
> > >    ...                                      ...
> > >    ...                                      mov    r9d,0x8c1c860
> > >    ...                                      add    r9,QWORD PTR gs:0x21a00
> > >    ...                                      ...
> > >    mov    rdx,rbp                           mov    rdx, r9
> > >    add    rdx,0xffffffffffffffb4            rdx,0xffffffffffffffb4
> > >    ...                                      ...
> > >    mov    ecx,0x28                          mov    ecx,0x28
> > >                                             push   r9
> > >    call   0xffffffffe305e474                call   0xffffffffe305e524
> > >                                             pop    r9
> > >    mov    rdi,rax                           mov    rdi,rax
> > >    ...                                      ...
> > >    movzx  rdi,BYTE PTR [rbp-0x46]           movzx  rdi,BYTE PTR [r9-0x46]
> > >    ...                                      ...
> > >
> >
> > Eduard nerd-sniped me today with this a bit... :)
> >
> > I have a few questions and suggestions.
> >
> > So it seems like each *subprogram* (not the entire BPF program) gets
> > its own per-CPU private stack allocation. Is that intentional? That
> > seems a bit unnecessary. It also prevents any sort of actual
> > recursion. Not sure if it's possible to write recursive BPF subprogram
> > today, verifier seems to reject obvious limited recursion cases, but
> > still, eventually we might need/want to support that, and this will be
> > just another hurdle to overcome (so it's best to avoid adding it in
> > the first place).
> >
> > I'm sure Eduard is going to try something like below and it will
> > probably break badly (I haven't tried, sorry):
> >
> > int entry(void *ctx);
> >
> > struct {
> >         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> >         __uint(max_entries, 1);
> >         __uint(key_size, sizeof(__u32));
> >         __array(values, int (void *));
> > } prog_array_init SEC(".maps") = {
> >         .values = {
> >                 [0] = (void *)&entry,
> >         },
> > };
> >
> > static __noinline int subprog1(void)
> > {
> >     <some state on the stack>
> >
> >     /* here entry will replace subprog1, and so we'll have
> >      * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
> >      */
> >     bpf_tail_call(ctx, &prog_array_init, 0);
> >
> >     return 0;
> > }
> >
> >
> > SEC("raw_tp/sys_enter")
> > int entry(void *ctx)
> > {
> >      <some state on the stack>
> >
> >      subprog1();
> > }
> >
> > And we effectively have limited recursion where the entry's stack
> > state is clobbered, no?
> >
> > So it seems like we need to support recursion.
> >
>
> How come everyone just completely ignored the main point of my entire
> email and a real problem that has to be solved?...
>
> Anyways, I did write a below program:
>
> $ cat minimal.bpf.c
> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> /* Copyright (c) 2020 Facebook */
> #include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>
>
> char LICENSE[] SEC("license") = "Dual BSD/GPL";
>
> int my_pid = 0;
>
> int handle_tp(void *ctx);
>
> struct {
>         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>         __uint(max_entries, 1);
>         __uint(key_size, sizeof(__u32));
>         __array(values, int (void *));
> } prog_array_init SEC(".maps") = {
>         .values = {
>                 [0] = (void *)&handle_tp,
>         },
> };
>
> static __noinline int subprog(void *ctx)
> {
>         static int cnt;
>
>         cnt++;
>
>         bpf_printk("SUBPROG - BEFORE %d", cnt);
>
>         bpf_tail_call(ctx, &prog_array_init, 0);
>
>         bpf_printk("SUBPROG - AFTER %d", cnt);
>
>     return 0;
> }
>
> SEC("tp/syscalls/sys_enter_write")
> int handle_tp(void *ctx)
> {
>         static int cnt;
>         int pid = bpf_get_current_pid_tgid() >> 32;
>
>         if (pid != my_pid)
>                 return 0;
>
>         cnt++;
>
>         bpf_printk("ENTRY - BEFORE %d", cnt);
>
>         subprog(ctx);
>
>         bpf_printk("ENTRY - AFTER %d", cnt);
>
>         return 0;
> }
>
>
> And triggered one write syscall, getting the log above. You can see
> that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
> to the tail call limit). And we do indeed get lots of entry program
> recurrence.
>
> We *need to support recursion* is my main point.

Not quite.
It's not a recursion. The stack collapsed/gone/wiped out before tail_call.
static int cnt counts stuff because it's static.

So we don't need to support recursion with private stack,
but tail_calls and private stack are buggy indeed.

emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
stack is used.