Re: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 22, 2024 at 6:06 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> > > >
> > > > The main motivation for private stack comes from nested
> > > > scheduler in sched-ext from Tejun. The basic idea is that
> > > >  - each cgroup will its own associated bpf program,
> > > >  - bpf program with parent cgroup will call bpf programs
> > > >    in immediate child cgroups.
> > > >
> > > > Let us say we have the following cgroup hierarchy:
> > > >   root_cg (prog0):
> > > >     cg1 (prog1):
> > > >       cg11 (prog11):
> > > >         cg111 (prog111)
> > > >         cg112 (prog112)
> > > >       cg12 (prog12):
> > > >         cg121 (prog121)
> > > >         cg122 (prog122)
> > > >     cg2 (prog2):
> > > >       cg21 (prog21)
> > > >       cg22 (prog22)
> > > >       cg23 (prog23)
> > > >
> > > > In the above example, prog0 will call a kfunc which will
> > > > call prog1 and prog2 to get sched info for cg1 and cg2 and
> > > > then the information is summarized and sent back to prog0.
> > > > Similarly, prog11 and prog12 will be invoked in the kfunc
> > > > and the result will be summarized and sent back to prog1, etc.
> > > >
> > > > Currently, for each thread, the x86 kernel allocate 8KB stack.
> > > > The each bpf program (including its subprograms) has maximum
> > > > 512B stack size to avoid potential stack overflow.
> > > > And nested bpf programs increase the risk of stack overflow.
> > > > To avoid potential stack overflow caused by bpf programs,
> > > > this patch implemented a private stack so bpf program stack
> > > > space is allocated dynamically when the program is jited.
> > > > Such private stack is applied to tracing programs like
> > > > kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
> > > > tracing.
> > > >
> > > > But more than one instance of the same bpf program may
> > > > run in the system. To make things simple, percpu private
> > > > stack is allocated for each program, so if the same program
> > > > is running on different cpus concurrently, we won't have
> > > > any issue. Note that the kernel already have logic to prevent
> > > > the recursion for the same bpf program on the same cpu
> > > > (kprobe, fentry, etc.).
> > > >
> > > > The patch implemented a percpu private stack based approach
> > > > for x86 arch.
> > > >   - The stack size will be 0 and any stack access is from
> > > >     jit-time allocated percpu storage.
> > > >   - In the beginning of jit, r9 is used to save percpu
> > > >     private stack pointer.
> > > >   - Each rbp in the bpf asm insn is replaced by r9.
> > > >   - For each call, push r9 before the call and pop r9
> > > >     after the call to preserve r9 value.
> > > >
> > > > Compared to previous RFC patch [1], this patch added
> > > > some conditions to enable private stack, e.g., verifier
> > > > calculated stack size, prog type, etc. The new patch
> > > > also added a performance test to compare private stack
> > > > vs. no private stack.
> > > >
> > > > The following are some code example to illustrate the idea
> > > > for selftest cgroup_skb_sk_lookup:
> > > >
> > > >    the existing code                        the private-stack approach code
> > > >    endbr64                                  endbr64
> > > >    nop    DWORD PTR [rax+rax*1+0x0]         nop    DWORD PTR [rax+rax*1+0x0]
> > > >    xchg   ax,ax                             xchg   ax,ax
> > > >    push   rbp                               push   rbp
> > > >    mov    rbp,rsp                           mov    rbp,rsp
> > > >    endbr64                                  endbr64
> > > >    sub    rsp,0x68
> > > >    push   rbx                               push   rbx
> > > >    ...                                      ...
> > > >    ...                                      mov    r9d,0x8c1c860
> > > >    ...                                      add    r9,QWORD PTR gs:0x21a00
> > > >    ...                                      ...
> > > >    mov    rdx,rbp                           mov    rdx, r9
> > > >    add    rdx,0xffffffffffffffb4            rdx,0xffffffffffffffb4
> > > >    ...                                      ...
> > > >    mov    ecx,0x28                          mov    ecx,0x28
> > > >                                             push   r9
> > > >    call   0xffffffffe305e474                call   0xffffffffe305e524
> > > >                                             pop    r9
> > > >    mov    rdi,rax                           mov    rdi,rax
> > > >    ...                                      ...
> > > >    movzx  rdi,BYTE PTR [rbp-0x46]           movzx  rdi,BYTE PTR [r9-0x46]
> > > >    ...                                      ...
> > > >
> > >
> > > Eduard nerd-sniped me today with this a bit... :)
> > >
> > > I have a few questions and suggestions.
> > >
> > > So it seems like each *subprogram* (not the entire BPF program) gets
> > > its own per-CPU private stack allocation. Is that intentional? That
> > > seems a bit unnecessary. It also prevents any sort of actual
> > > recursion. Not sure if it's possible to write recursive BPF subprogram
> > > today, verifier seems to reject obvious limited recursion cases, but
> > > still, eventually we might need/want to support that, and this will be
> > > just another hurdle to overcome (so it's best to avoid adding it in
> > > the first place).
> > >
> > > I'm sure Eduard is going to try something like below and it will
> > > probably break badly (I haven't tried, sorry):
> > >
> > > int entry(void *ctx);
> > >
> > > struct {
> > >         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> > >         __uint(max_entries, 1);
> > >         __uint(key_size, sizeof(__u32));
> > >         __array(values, int (void *));
> > > } prog_array_init SEC(".maps") = {
> > >         .values = {
> > >                 [0] = (void *)&entry,
> > >         },
> > > };
> > >
> > > static __noinline int subprog1(void)
> > > {
> > >     <some state on the stack>
> > >
> > >     /* here entry will replace subprog1, and so we'll have
> > >      * entry -> entry -> entry -> ..... <tail call limit> -> subprog1
> > >      */
> > >     bpf_tail_call(ctx, &prog_array_init, 0);
> > >
> > >     return 0;
> > > }
> > >
> > >
> > > SEC("raw_tp/sys_enter")
> > > int entry(void *ctx)
> > > {
> > >      <some state on the stack>
> > >
> > >      subprog1();
> > > }
> > >
> > > And we effectively have limited recursion where the entry's stack
> > > state is clobbered, no?
> > >
> > > So it seems like we need to support recursion.
> > >
> >
> > How come everyone just completely ignored the main point of my entire
> > email and a real problem that has to be solved?...
> >
> > Anyways, I did write a below program:
> >
> > $ cat minimal.bpf.c
> > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> > /* Copyright (c) 2020 Facebook */
> > #include <linux/bpf.h>
> > #include <bpf/bpf_helpers.h>
> >
> > char LICENSE[] SEC("license") = "Dual BSD/GPL";
> >
> > int my_pid = 0;
> >
> > int handle_tp(void *ctx);
> >
> > struct {
> >         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> >         __uint(max_entries, 1);
> >         __uint(key_size, sizeof(__u32));
> >         __array(values, int (void *));
> > } prog_array_init SEC(".maps") = {
> >         .values = {
> >                 [0] = (void *)&handle_tp,
> >         },
> > };
> >
> > static __noinline int subprog(void *ctx)
> > {
> >         static int cnt;
> >
> >         cnt++;
> >
> >         bpf_printk("SUBPROG - BEFORE %d", cnt);
> >
> >         bpf_tail_call(ctx, &prog_array_init, 0);
> >
> >         bpf_printk("SUBPROG - AFTER %d", cnt);
> >
> >     return 0;
> > }
> >
> > SEC("tp/syscalls/sys_enter_write")
> > int handle_tp(void *ctx)
> > {
> >         static int cnt;
> >         int pid = bpf_get_current_pid_tgid() >> 32;
> >
> >         if (pid != my_pid)
> >                 return 0;
> >
> >         cnt++;
> >
> >         bpf_printk("ENTRY - BEFORE %d", cnt);
> >
> >         subprog(ctx);
> >
> >         bpf_printk("ENTRY - AFTER %d", cnt);
> >
> >         return 0;
> > }
> >
> >
> > And triggered one write syscall, getting the log above. You can see
> > that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due
> > to the tail call limit). And we do indeed get lots of entry program
> > recurrence.
> >
> > We *need to support recursion* is my main point.
>
> Not quite.
> It's not a recursion. The stack collapsed/gone/wiped out before tail_call.

Only of subprog(), not of handle_tp(). See all those "ENTRY - AFTER"
messages. We do return to all the nested handle_tp() calls and
continue just fine.

I put the log into [0] for a bit easier visual inspection.

  [0] https://gist.github.com/anakryiko/6ccdfc62188f8ad4991641fb637d954c

> static int cnt counts stuff because it's static.
>
> So we don't need to support recursion with private stack,
> but tail_calls and private stack are buggy indeed.
>
> emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private
> stack is used.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux