Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Wed, 9 Oct 2024 19:37:53 +0200

On Wed, 9 Oct 2024 at 18:38, Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>
> On Wed, 9 Oct 2024 at 18:36, Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> >
> > On Wed, 9 Oct 2024 at 04:06, Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Tue, Oct 8, 2024 at 3:10 PM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > We need to scrap this idea.
> > > > Let's go back to push/pop r11 around calls :(
> > >
> > > I didn't give up :)
> > >
> > > Here is a new idea that seems to work:
> > >
> > > [  131.472066]  dump_stack_lvl+0x53/0x70
> > > [  131.472066]  bpf_task_storage_get+0x3e/0x2f0
> > > [  131.472066]  ? bpf_task_storage_get+0x231/0x2f0
> > > [  131.472066]  bpf_prog_ed7a5f33cc9fefab_foo+0x30/0x32
> > > [  131.472066]  bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d
> > > ...
> > > [  131.417145]  dump_stack_lvl+0x53/0x70
> > > [  131.417145]  bpf_task_storage_get+0x3e/0x2f0
> > > [  131.417145]  ? selinux_netlbl_socket_post_create+0xab/0x150
> > > [  131.417145]  bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x60/0x6d
> > >
> > >
> > > The stack dump works fine out of main prog and out of subprog.
> > >
> > > The key difference it to pretend to have stack_depth=0,
> > > so there is no adjustment to %rsp,
> > > but point %rbp to per-cpu private stack and grow it _up_.
> > >
> > > For the main prog %rbp points to the bottom of priv stack
> > > plus stack_depth it needs,
> > > so all bpf insns that do r10-off access the bottom of that priv stack.
> > > When subprog is called it does 'add %rbp, its_stack_depth' and
> > > in turn it's using memory above the bottom of the priv stack.
> > >
> > > That seems to work, but exceptions and tailcalls are broken.
> >
> > I fixed exceptions, the reason it breaks is because we:
> > We get rsp and rbp for the main frame from unwinding.
> > rsp has undergone subtraction for: stack depth, push r12, push callee regs.
> >
> > When setting up the frame for exception cb, we need to pop saved
> > registers from stack and then 'reset stack frame' using mov rsp, rbp.
> > That effectively undoes the subtraction that happened for stack depth,
> > and at this point rsp == rbp. Then the verifier will set up the frame
> > for exception cb like it does normally for any prog: subtract stack
> > depth, and push callee saved regs.
> >
> > Now all of this was ok before, but this patch makes two changes:
> > stack_depth is not subtracted, and rbp is a per-cpu stack pointer.
> >
> > Therefore, at the top of the stack is just the callee saved regs and r12.
> > After popping those, it will be equal to the original rbp which was
> > overwritten with per-cpu stack pointer.
> >
> > Doing mov rsp, rbp for this patch will reset rsp to per-cpu stack
> > pointer. Instead, we do mov rbp, rsp. This restores the rbp to kernel
> > stack pointer, and then the subsequent leave etc. return control back
> > into the kernel.
> >
> > At least this seems to make everything work fine, and things no longer
> > crash, and it looks sane etc.
> >
> > I will dig into the tail call case a bit later, but most likely it's a
> > variation of this problem.
>
> FYI, this is the fix.
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 268f7d37466c..02267adee14b 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -523,7 +523,7 @@ static void emit_prologue(struct bpf_prog
> *bpf_prog, u8 **pprog, u32 stack_depth
>                 pop_callee_regs(&prog, all_callee_regs_used);
>                 pop_r12(&prog);
>                 /* Reset the stack frame. */
> -               EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
> +               EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
>         } else {
>                 EMIT1(0x55);             /* push rbp */
>                 if (tail_call_reachable || !bpf_prog->aux->priv_stack_ptr) {
>
> >

Figured out the problem with tail calls as well.
It was actually incorrect X86_TAIL_CALL_OFFSET, since a tail call
target using private stack will emit different instructions (to set up
private stack) now.
When it is actually called it has to skip over all that stuff, but the
offset to skip past the endbr is incorrect now.
Figured out since the page fault was happening to access the address
equal to per-cpu offset (say X) and the faulting instruction was in
the middle of add r11, gs:X.

Calculation is:
5 for emit_nops, 3 for emit_nops of 3 bytes (for alignment), 1 for
push rbp, and 22 bytes for the sequence to set up the private stack.
This, + ENDBR_INSN_SIZE.

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 268f7d37466c..004e3ba1d338 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -323,7 +323,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE         5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET   (12 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET   (5 + 3 + 1 + 22 + ENDBR_INSN_SIZE)

 static void push_r12(u8 **pprog)

 {

> > > I ran out of time today to debug.
> > > Pls see the attached patch.