On Wed, 9 Oct 2024 at 04:06, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Oct 8, 2024 at 3:10 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > We need to scrap this idea. > > Let's go back to push/pop r11 around calls :( > > I didn't give up :) > > Here is a new idea that seems to work: > > [ 131.472066] dump_stack_lvl+0x53/0x70 > [ 131.472066] bpf_task_storage_get+0x3e/0x2f0 > [ 131.472066] ? bpf_task_storage_get+0x231/0x2f0 > [ 131.472066] bpf_prog_ed7a5f33cc9fefab_foo+0x30/0x32 > [ 131.472066] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d > ... > [ 131.417145] dump_stack_lvl+0x53/0x70 > [ 131.417145] bpf_task_storage_get+0x3e/0x2f0 > [ 131.417145] ? selinux_netlbl_socket_post_create+0xab/0x150 > [ 131.417145] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x60/0x6d > > > The stack dump works fine out of main prog and out of subprog. > > The key difference it to pretend to have stack_depth=0, > so there is no adjustment to %rsp, > but point %rbp to per-cpu private stack and grow it _up_. > > For the main prog %rbp points to the bottom of priv stack > plus stack_depth it needs, > so all bpf insns that do r10-off access the bottom of that priv stack. > When subprog is called it does 'add %rbp, its_stack_depth' and > in turn it's using memory above the bottom of the priv stack. > > That seems to work, but exceptions and tailcalls are broken. I fixed exceptions, the reason it breaks is because we: We get rsp and rbp for the main frame from unwinding. rsp has undergone subtraction for: stack depth, push r12, push callee regs. When setting up the frame for exception cb, we need to pop saved registers from stack and then 'reset stack frame' using mov rsp, rbp. That effectively undoes the subtraction that happened for stack depth, and at this point rsp == rbp. Then the verifier will set up the frame for exception cb like it does normally for any prog: subtract stack depth, and push callee saved regs. Now all of this was ok before, but this patch makes two changes: stack_depth is not subtracted, and rbp is a per-cpu stack pointer. Therefore, at the top of the stack is just the callee saved regs and r12. After popping those, it will be equal to the original rbp which was overwritten with per-cpu stack pointer. Doing mov rsp, rbp for this patch will reset rsp to per-cpu stack pointer. Instead, we do mov rbp, rsp. This restores the rbp to kernel stack pointer, and then the subsequent leave etc. return control back into the kernel. At least this seems to make everything work fine, and things no longer crash, and it looks sane etc. I will dig into the tail call case a bit later, but most likely it's a variation of this problem. > I ran out of time today to debug. > Pls see the attached patch.