Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 10/1/24 11:28 PM, Yonghong Song wrote:

On 10/1/24 7:16 PM, Kumar Kartikeya Dwivedi wrote:
On Wed, 2 Oct 2024 at 03:26, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
Makes sense, though will we have cases where hierarchical scheduling
attaches the same prog at different points of the hierarchy?
I'm not sure anyone was asking for such a use case.
I wondered because why would you then need a limit of 4 (say instead
of disallowing it)?

Then the
limit of 4 may not be enough (e.g. say with cgroup nested levels > 4).
Well, 4 was the number from TJ.

Ok, then let's assume 4 would be enough.

Anyway the proposed pseudo code:

__bpf_prog_enter_recur_limited()
{
   cnt = this_cpu_inc_return(*(prog->active));
   if (cnt > 4) {
      inc_miss
      return 0;
   }
  // pass cnt into bpf prog somehow, like %rdx ?
  // or re-read prog->active from prog
}


then in the prologue emit:

push rbp
mov rbp, rsp
if %rdx == 1
    // main prog is called for the first time
    mov rsp, pcpu_priv_stack_top

This sounds good in high level. I still need to figure out
'if %rdx == 1' part and how to implement this.

Okay, looks like trampoline could supply rdx == 1.


else
    // 2+nd time main prog is called or 1+ time subprog
   sub rsp, stack_size
   if rsp < pcpu_priv_stack_bottom
     goto exit  // stack is too small, exit
fi
I think we need just the second part for subprogs, right?
Since rdx is R3 (arg into subprog).
I guess that's what you meant in the pseudocode.
But otherwise sounds good.
The benefit with stack probing is we don't exactly limit to 4 cases.

Another option instead of the branch in main prog is to divide in 4
slots (as you said before) and choose the slot based on cnt.
But then we're stuck with a max limit of 4. Since we're allocating
stack size of bpf + extra (which I guess is 8K?). rdx can be used to
pass in the priv_stack address of the right slot.

So I think the probing version seems better. We can probably pass in
rdx = priv_stack and then test and cmov instead for main prog.

Yes, we do not need to limit to 4, checking rsp < pcpu_priv_stack_bottom
should be okay.


Since stack bottom/top are known at JIT time we can
generate reliable stack overflow checks.
Much better than guard pages and -fstack-protector.
The prog can alloc percpu
(stack size of main prog + subprogs + extra) * 4
extra will be 8K, I guess (same as kernel stack size)?
Just confirming.

and it likely will be enough.
If not, the stack protection will gently exit the prog
when the stack is too deep.
I like this stack probing version, since there's no hard limit on the
number of recursions, and it's safe against stack overflow as well.

kfunc won't have such a check, so we need a buffer zone.
Can have a guard page too, but feels like overkill.
I was leaning toward saying yes for a guard page, since we'll atleast
have a hard error instead of random corruption if the kfunc goes
beyond the bottom after probing succeeds.

But the better way might be doing if rsp < pcpu_priv_stack_bottom +
8K, so we leave max headroom we reserve for kernel stuff (or say add
4K instead, which should be good enough), and then skip execution.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux