On Mon, Jul 22, 2024 at 1:58 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Fri, Jul 19, 2024 at 8:28 PM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > > > > > The main motivation for private stack comes from nested > > > scheduler in sched-ext from Tejun. The basic idea is that > > > - each cgroup will its own associated bpf program, > > > - bpf program with parent cgroup will call bpf programs > > > in immediate child cgroups. > > > > > > Let us say we have the following cgroup hierarchy: > > > root_cg (prog0): > > > cg1 (prog1): > > > cg11 (prog11): > > > cg111 (prog111) > > > cg112 (prog112) > > > cg12 (prog12): > > > cg121 (prog121) > > > cg122 (prog122) > > > cg2 (prog2): > > > cg21 (prog21) > > > cg22 (prog22) > > > cg23 (prog23) > > > > > > In the above example, prog0 will call a kfunc which will > > > call prog1 and prog2 to get sched info for cg1 and cg2 and > > > then the information is summarized and sent back to prog0. > > > Similarly, prog11 and prog12 will be invoked in the kfunc > > > and the result will be summarized and sent back to prog1, etc. > > > > > > Currently, for each thread, the x86 kernel allocate 8KB stack. > > > The each bpf program (including its subprograms) has maximum > > > 512B stack size to avoid potential stack overflow. > > > And nested bpf programs increase the risk of stack overflow. > > > To avoid potential stack overflow caused by bpf programs, > > > this patch implemented a private stack so bpf program stack > > > space is allocated dynamically when the program is jited. > > > Such private stack is applied to tracing programs like > > > kprobe/uprobe, perf_event, tracepoint, raw tracepoint and > > > tracing. > > > > > > But more than one instance of the same bpf program may > > > run in the system. To make things simple, percpu private > > > stack is allocated for each program, so if the same program > > > is running on different cpus concurrently, we won't have > > > any issue. Note that the kernel already have logic to prevent > > > the recursion for the same bpf program on the same cpu > > > (kprobe, fentry, etc.). > > > > > > The patch implemented a percpu private stack based approach > > > for x86 arch. > > > - The stack size will be 0 and any stack access is from > > > jit-time allocated percpu storage. > > > - In the beginning of jit, r9 is used to save percpu > > > private stack pointer. > > > - Each rbp in the bpf asm insn is replaced by r9. > > > - For each call, push r9 before the call and pop r9 > > > after the call to preserve r9 value. > > > > > > Compared to previous RFC patch [1], this patch added > > > some conditions to enable private stack, e.g., verifier > > > calculated stack size, prog type, etc. The new patch > > > also added a performance test to compare private stack > > > vs. no private stack. > > > > > > The following are some code example to illustrate the idea > > > for selftest cgroup_skb_sk_lookup: > > > > > > the existing code the private-stack approach code > > > endbr64 endbr64 > > > nop DWORD PTR [rax+rax*1+0x0] nop DWORD PTR [rax+rax*1+0x0] > > > xchg ax,ax xchg ax,ax > > > push rbp push rbp > > > mov rbp,rsp mov rbp,rsp > > > endbr64 endbr64 > > > sub rsp,0x68 > > > push rbx push rbx > > > ... ... > > > ... mov r9d,0x8c1c860 > > > ... add r9,QWORD PTR gs:0x21a00 > > > ... ... > > > mov rdx,rbp mov rdx, r9 > > > add rdx,0xffffffffffffffb4 rdx,0xffffffffffffffb4 > > > ... ... > > > mov ecx,0x28 mov ecx,0x28 > > > push r9 > > > call 0xffffffffe305e474 call 0xffffffffe305e524 > > > pop r9 > > > mov rdi,rax mov rdi,rax > > > ... ... > > > movzx rdi,BYTE PTR [rbp-0x46] movzx rdi,BYTE PTR [r9-0x46] > > > ... ... > > > > > > > Eduard nerd-sniped me today with this a bit... :) > > > > I have a few questions and suggestions. > > > > So it seems like each *subprogram* (not the entire BPF program) gets > > its own per-CPU private stack allocation. Is that intentional? That > > seems a bit unnecessary. It also prevents any sort of actual > > recursion. Not sure if it's possible to write recursive BPF subprogram > > today, verifier seems to reject obvious limited recursion cases, but > > still, eventually we might need/want to support that, and this will be > > just another hurdle to overcome (so it's best to avoid adding it in > > the first place). > > > > I'm sure Eduard is going to try something like below and it will > > probably break badly (I haven't tried, sorry): > > > > int entry(void *ctx); > > > > struct { > > __uint(type, BPF_MAP_TYPE_PROG_ARRAY); > > __uint(max_entries, 1); > > __uint(key_size, sizeof(__u32)); > > __array(values, int (void *)); > > } prog_array_init SEC(".maps") = { > > .values = { > > [0] = (void *)&entry, > > }, > > }; > > > > static __noinline int subprog1(void) > > { > > <some state on the stack> > > > > /* here entry will replace subprog1, and so we'll have > > * entry -> entry -> entry -> ..... <tail call limit> -> subprog1 > > */ > > bpf_tail_call(ctx, &prog_array_init, 0); > > > > return 0; > > } > > > > > > SEC("raw_tp/sys_enter") > > int entry(void *ctx) > > { > > <some state on the stack> > > > > subprog1(); > > } > > > > And we effectively have limited recursion where the entry's stack > > state is clobbered, no? > > > > So it seems like we need to support recursion. > > > > How come everyone just completely ignored the main point of my entire > email and a real problem that has to be solved?... > > Anyways, I did write a below program: > > $ cat minimal.bpf.c > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause > /* Copyright (c) 2020 Facebook */ > #include <linux/bpf.h> > #include <bpf/bpf_helpers.h> > > char LICENSE[] SEC("license") = "Dual BSD/GPL"; > > int my_pid = 0; > > int handle_tp(void *ctx); > > struct { > __uint(type, BPF_MAP_TYPE_PROG_ARRAY); > __uint(max_entries, 1); > __uint(key_size, sizeof(__u32)); > __array(values, int (void *)); > } prog_array_init SEC(".maps") = { > .values = { > [0] = (void *)&handle_tp, > }, > }; > > static __noinline int subprog(void *ctx) > { > static int cnt; > > cnt++; > > bpf_printk("SUBPROG - BEFORE %d", cnt); > > bpf_tail_call(ctx, &prog_array_init, 0); > > bpf_printk("SUBPROG - AFTER %d", cnt); > > return 0; > } > > SEC("tp/syscalls/sys_enter_write") > int handle_tp(void *ctx) > { > static int cnt; > int pid = bpf_get_current_pid_tgid() >> 32; > > if (pid != my_pid) > return 0; > > cnt++; > > bpf_printk("ENTRY - BEFORE %d", cnt); > > subprog(ctx); > > bpf_printk("ENTRY - AFTER %d", cnt); > > return 0; > } > > > And triggered one write syscall, getting the log above. You can see > that only subprogs are replaced (we only get "SUBPROG - AFTER 34" due > to the tail call limit). And we do indeed get lots of entry program > recurrence. > > We *need to support recursion* is my main point. Not quite. It's not a recursion. The stack collapsed/gone/wiped out before tail_call. static int cnt counts stuff because it's static. So we don't need to support recursion with private stack, but tail_calls and private stack are buggy indeed. emit_bpf_tail_call*() shouldn't be adjusting 'rsp' when the private stack is used.