Another idea... Currently the prologue looks like: push rbp mov rbp, rsp sub rsp, stack_depth how about in the main prog we keep the first two insns, but then set rsp with a single insn to point to the top of our private stack that should have enough room for stack_of_main_prog + stacks_of_all_subprogs + extra 8k for kfuncs/helpers. The prologue of all subprogs will stay as-is with above 3 insns. The epilogue is the same in main prog and subprogs: leave + ret. Such stack will look like a typical split stack used in compilers. The obvious advantage is we don't need to touch r9, do push/pop, and stack unwind will work just fine. In the past we discussed something like this, but then we did all 3 insns in the private stack and it was problematic due to IRQs. In this approach the main prog will use up to 512 bytes of kernel stack, but everything that it calls will be in the private stack, and since it doesn't migrate there is no per-cpu memory reuse issue. Thoughts?