On Wed, Feb 14, 2024 at 11:53 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > For each active kernel thread, the thread stack size is 2*PAGE_SIZE ([1]). > Each bpf program has a maximum stack size 512 bytes to avoid > overflowing the thread stack. But nested bpf programs may post > a challenge to avoid stack overflow. > > For example, currently we already allow nested bpf > programs esp in tracing, i.e., > Prog_A > -> Call Helper_B > -> Call Func_C > -> fentry program is called due to Func_C. > -> Call Helper_D and then Func_E > -> fentry due to Func_E > -> ... > If we have too many bpf programs in the chain and each bpf program > has close to 512 byte stack size, it could overflow the kernel thread > stack. > > Another more practical potential use case is from a discussion between > Alexei and Tejun. It is possible for a complex scheduler like sched-ext, > we could have BPF prog hierarchy like below: > Prog_1 (at system level) > Prog_Numa_1 Prog_Numa_2 ... Prog_Numa_4 > Prog_LLC_1 Prog_LLC_2 ... > Prog_CPU_1 ... > > Basically, the top bpf program (Prog_1) will call Prog_Numa_* programs > > through a kfunc to collect information from programs in each numa node. > Each Prog_Numa_* program will call Prog_LLC_* programs to collect > information from programs in each llc domain in that particular > numa node, etc. The same for Prog_LLC_* vs. Prog_CPU_*. > Now we have four level nested bpf programs. > > The proposed approach is to allocate stack from heap for > each bpf program. That way, we do not need to worry about > kernel stack overflow. Such an approach is called > segmented stacks ([2]) in clang/gcc/go etc. > > Obviously there are some drawbacks for segmented stack approach: > - some performance degradation, so this approach may not for everyone. > - stack backtracking, kernel changes are necessary. I suspect segmented stacks the way compilers do them are not suitable for bpf progs, since they break backtraces and backtrace is a crucial feature that must work even when there are kernel bugs. How about we keep call/ret, save/restore of callee saved regs in the normal stack, but use a parallel memory (per-cpu or some other) for bpf prog needs. What bpf prog thinks of stack will be in that memory while the call chain will remain correct. >From bpf prog pov the stack is where bpf_reg_r10 points to. It doesn't have to be in the kernel stack. Shadow memory will work. Let's also call it something else than "segmented stack" to avoid confusion.