For each active kernel thread, the thread stack size is 2*PAGE_SIZE ([1]). Each bpf program has a maximum stack size 512 bytes to avoid overflowing the thread stack. But nested bpf programs may post a challenge to avoid stack overflow. For example, currently we already allow nested bpf programs esp in tracing, i.e., Prog_A -> Call Helper_B -> Call Func_C -> fentry program is called due to Func_C. -> Call Helper_D and then Func_E -> fentry due to Func_E -> ... If we have too many bpf programs in the chain and each bpf program has close to 512 byte stack size, it could overflow the kernel thread stack. Another more practical potential use case is from a discussion between Alexei and Tejun. It is possible for a complex scheduler like sched-ext, we could have BPF prog hierarchy like below: Prog_1 (at system level) Prog_Numa_1 Prog_Numa_2 ... Prog_Numa_4 Prog_LLC_1 Prog_LLC_2 ... Prog_CPU_1 ... Basically, the top bpf program (Prog_1) will call Prog_Numa_* programs through a kfunc to collect information from programs in each numa node. Each Prog_Numa_* program will call Prog_LLC_* programs to collect information from programs in each llc domain in that particular numa node, etc. The same for Prog_LLC_* vs. Prog_CPU_*. Now we have four level nested bpf programs. The proposed approach is to allocate stack from heap for each bpf program. That way, we do not need to worry about kernel stack overflow. Such an approach is called segmented stacks ([2]) in clang/gcc/go etc. Obviously there are some drawbacks for segmented stack approach: - some performance degradation, so this approach may not for everyone. - stack backtracking, kernel changes are necessary. [1] https://www.kernel.org/doc/html/next/x86/kernel-stacks.html [2] https://releases.llvm.org/3.0/docs/SegmentedStacks.html