Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2 Oct 2024 at 03:26, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Oct 1, 2024 at 5:23 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> >
> > Makes sense, though will we have cases where hierarchical scheduling
> > attaches the same prog at different points of the hierarchy?
>
> I'm not sure anyone was asking for such a use case.

I wondered because why would you then need a limit of 4 (say instead
of disallowing it)?

>
> > Then the
> > limit of 4 may not be enough (e.g. say with cgroup nested levels > 4).
>
> Well, 4 was the number from TJ.
>

Ok, then let's assume 4 would be enough.

> Anyway the proposed pseudo code:
>
> __bpf_prog_enter_recur_limited()
> {
>   cnt = this_cpu_inc_return(*(prog->active));
>   if (cnt > 4) {
>      inc_miss
>      return 0;
>   }
>  // pass cnt into bpf prog somehow, like %rdx ?
>  // or re-read prog->active from prog
> }
>
>
> then in the prologue emit:
>
> push rbp
> mov rbp, rsp
> if %rdx == 1
>    // main prog is called for the first time
>    mov rsp, pcpu_priv_stack_top
> else
>    // 2+nd time main prog is called or 1+ time subprog
>   sub rsp, stack_size
>   if rsp < pcpu_priv_stack_bottom
>     goto exit  // stack is too small, exit
> fi

I think we need just the second part for subprogs, right?
Since rdx is R3 (arg into subprog).
I guess that's what you meant in the pseudocode.
But otherwise sounds good.
The benefit with stack probing is we don't exactly limit to 4 cases.

Another option instead of the branch in main prog is to divide in 4
slots (as you said before) and choose the slot based on cnt.
But then we're stuck with a max limit of 4. Since we're allocating
stack size of bpf + extra (which I guess is 8K?). rdx can be used to
pass in the priv_stack address of the right slot.

So I think the probing version seems better. We can probably pass in
rdx = priv_stack and then test and cmov instead for main prog.

>
> Since stack bottom/top are known at JIT time we can
> generate reliable stack overflow checks.
> Much better than guard pages and -fstack-protector.
> The prog can alloc percpu
> (stack size of main prog + subprogs + extra) * 4

extra will be 8K, I guess (same as kernel stack size)?
Just confirming.

> and it likely will be enough.
> If not, the stack protection will gently exit the prog
> when the stack is too deep.

I like this stack probing version, since there's no hard limit on the
number of recursions, and it's safe against stack overflow as well.

> kfunc won't have such a check, so we need a buffer zone.
> Can have a guard page too, but feels like overkill.

I was leaning toward saying yes for a guard page, since we'll atleast
have a hard error instead of random corruption if the kfunc goes
beyond the bottom after probing succeeds.

But the better way might be doing if rsp < pcpu_priv_stack_bottom +
8K, so we leave max headroom we reserve for kernel stuff (or say add
4K instead, which should be good enough), and then skip execution.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux