Re: [LSF/MM/BPF TOPIC] Segmented Stacks for BPF Programs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 14 Feb 2024 18:20:16 -0800

On Wed, Feb 14, 2024 at 11:53 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
> For each active kernel thread, the thread stack size is 2*PAGE_SIZE ([1]).
> Each bpf program has a maximum stack size 512 bytes to avoid
> overflowing the thread stack. But nested bpf programs may post
> a challenge to avoid stack overflow.
>
> For example, currently we already allow nested bpf
> programs esp in tracing, i.e.,
>    Prog_A
>      -> Call Helper_B
>        -> Call Func_C
>          -> fentry program is called due to Func_C.
>            -> Call Helper_D and then Func_E
>              -> fentry due to Func_E
>                -> ...
> If we have too many bpf programs in the chain and each bpf program
> has close to 512 byte stack size, it could overflow the kernel thread
> stack.
>
> Another more practical potential use case is from a discussion between
> Alexei and Tejun. It is possible for a complex scheduler like sched-ext,
> we could have BPF prog hierarchy like below:
>                         Prog_1 (at system level)
>            Prog_Numa_1    Prog_Numa_2 ...  Prog_Numa_4
>         Prog_LLC_1 Prog_LLC_2 ...
>       Prog_CPU_1 ...
>
> Basically, the top bpf program (Prog_1) will call Prog_Numa_* programs
>
> through a kfunc to collect information from programs in each numa node.
> Each Prog_Numa_* program will call Prog_LLC_* programs to collect
> information from programs in each llc domain in that particular
> numa node, etc. The same for Prog_LLC_* vs. Prog_CPU_*.
> Now we have four level nested bpf programs.
>
> The proposed approach is to allocate stack from heap for
> each bpf program. That way, we do not need to worry about
> kernel stack overflow. Such an approach is called
> segmented stacks ([2]) in clang/gcc/go etc.
>
> Obviously there are some drawbacks for segmented stack approach:
>   - some performance degradation, so this approach may not for everyone.
>   - stack backtracking,  kernel changes are necessary.

I suspect segmented stacks the way compilers do them are not suitable
for bpf progs, since they break backtraces and backtrace is a crucial
feature that must work even when there are kernel bugs.
How about we keep call/ret, save/restore of callee saved regs
in the normal stack, but use a parallel memory (per-cpu or some other)
for bpf prog needs. What bpf prog thinks of stack will be in that memory
while the call chain will remain correct.
>From bpf prog pov the stack is where bpf_reg_r10 points to.
It doesn't have to be in the kernel stack. Shadow memory will work.

Let's also call it something else than "segmented stack" to avoid
confusion.