On Wed, 2 Oct 2024 at 03:26, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > > > Makes sense, though will we have cases where hierarchical scheduling > > attaches the same prog at different points of the hierarchy? > > I'm not sure anyone was asking for such a use case. I wondered because why would you then need a limit of 4 (say instead of disallowing it)? > > > Then the > > limit of 4 may not be enough (e.g. say with cgroup nested levels > 4). > > Well, 4 was the number from TJ. > Ok, then let's assume 4 would be enough. > Anyway the proposed pseudo code: > > __bpf_prog_enter_recur_limited() > { > cnt = this_cpu_inc_return(*(prog->active)); > if (cnt > 4) { > inc_miss > return 0; > } > // pass cnt into bpf prog somehow, like %rdx ? > // or re-read prog->active from prog > } > > > then in the prologue emit: > > push rbp > mov rbp, rsp > if %rdx == 1 > // main prog is called for the first time > mov rsp, pcpu_priv_stack_top > else > // 2+nd time main prog is called or 1+ time subprog > sub rsp, stack_size > if rsp < pcpu_priv_stack_bottom > goto exit // stack is too small, exit > fi I think we need just the second part for subprogs, right? Since rdx is R3 (arg into subprog). I guess that's what you meant in the pseudocode. But otherwise sounds good. The benefit with stack probing is we don't exactly limit to 4 cases. Another option instead of the branch in main prog is to divide in 4 slots (as you said before) and choose the slot based on cnt. But then we're stuck with a max limit of 4. Since we're allocating stack size of bpf + extra (which I guess is 8K?). rdx can be used to pass in the priv_stack address of the right slot. So I think the probing version seems better. We can probably pass in rdx = priv_stack and then test and cmov instead for main prog. > > Since stack bottom/top are known at JIT time we can > generate reliable stack overflow checks. > Much better than guard pages and -fstack-protector. > The prog can alloc percpu > (stack size of main prog + subprogs + extra) * 4 extra will be 8K, I guess (same as kernel stack size)? Just confirming. > and it likely will be enough. > If not, the stack protection will gently exit the prog > when the stack is too deep. I like this stack probing version, since there's no hard limit on the number of recursions, and it's safe against stack overflow as well. > kfunc won't have such a check, so we need a buffer zone. > Can have a guard page too, but feels like overkill. I was leaning toward saying yes for a guard page, since we'll atleast have a hard error instead of random corruption if the kfunc goes beyond the bottom after probing succeeds. But the better way might be doing if rsp < pcpu_priv_stack_bottom + 8K, so we leave max headroom we reserve for kernel stuff (or say add 4K instead, which should be good enough), and then skip execution.