On Wed, 9 Oct 2024 at 18:36, Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > On Wed, 9 Oct 2024 at 04:06, Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Tue, Oct 8, 2024 at 3:10 PM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > We need to scrap this idea. > > > Let's go back to push/pop r11 around calls :( > > > > I didn't give up :) > > > > Here is a new idea that seems to work: > > > > [ 131.472066] dump_stack_lvl+0x53/0x70 > > [ 131.472066] bpf_task_storage_get+0x3e/0x2f0 > > [ 131.472066] ? bpf_task_storage_get+0x231/0x2f0 > > [ 131.472066] bpf_prog_ed7a5f33cc9fefab_foo+0x30/0x32 > > [ 131.472066] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d > > ... > > [ 131.417145] dump_stack_lvl+0x53/0x70 > > [ 131.417145] bpf_task_storage_get+0x3e/0x2f0 > > [ 131.417145] ? selinux_netlbl_socket_post_create+0xab/0x150 > > [ 131.417145] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x60/0x6d > > > > > > The stack dump works fine out of main prog and out of subprog. > > > > The key difference it to pretend to have stack_depth=0, > > so there is no adjustment to %rsp, > > but point %rbp to per-cpu private stack and grow it _up_. > > > > For the main prog %rbp points to the bottom of priv stack > > plus stack_depth it needs, > > so all bpf insns that do r10-off access the bottom of that priv stack. > > When subprog is called it does 'add %rbp, its_stack_depth' and > > in turn it's using memory above the bottom of the priv stack. > > > > That seems to work, but exceptions and tailcalls are broken. > > I fixed exceptions, the reason it breaks is because we: > We get rsp and rbp for the main frame from unwinding. > rsp has undergone subtraction for: stack depth, push r12, push callee regs. > > When setting up the frame for exception cb, we need to pop saved > registers from stack and then 'reset stack frame' using mov rsp, rbp. > That effectively undoes the subtraction that happened for stack depth, > and at this point rsp == rbp. Then the verifier will set up the frame > for exception cb like it does normally for any prog: subtract stack > depth, and push callee saved regs. > > Now all of this was ok before, but this patch makes two changes: > stack_depth is not subtracted, and rbp is a per-cpu stack pointer. > > Therefore, at the top of the stack is just the callee saved regs and r12. > After popping those, it will be equal to the original rbp which was > overwritten with per-cpu stack pointer. > > Doing mov rsp, rbp for this patch will reset rsp to per-cpu stack > pointer. Instead, we do mov rbp, rsp. This restores the rbp to kernel > stack pointer, and then the subsequent leave etc. return control back > into the kernel. > > At least this seems to make everything work fine, and things no longer > crash, and it looks sane etc. > > I will dig into the tail call case a bit later, but most likely it's a > variation of this problem. FYI, this is the fix. diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 268f7d37466c..02267adee14b 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -523,7 +523,7 @@ static void emit_prologue(struct bpf_prog *bpf_prog, u8 **pprog, u32 stack_depth pop_callee_regs(&prog, all_callee_regs_used); pop_r12(&prog); /* Reset the stack frame. */ - EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */ + EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ } else { EMIT1(0x55); /* push rbp */ if (tail_call_reachable || !bpf_prog->aux->priv_stack_ptr) { > > > I ran out of time today to debug. > > Pls see the attached patch.