On 10/9/24 7:56 AM, Alexei Starovoitov wrote:
On Tue, Oct 8, 2024 at 11:31 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
On 10/8/24 7:06 PM, Alexei Starovoitov wrote:
On Tue, Oct 8, 2024 at 3:10 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
We need to scrap this idea.
Let's go back to push/pop r11 around calls :(
I didn't give up :)
Here is a new idea that seems to work:
[ 131.472066] dump_stack_lvl+0x53/0x70
[ 131.472066] bpf_task_storage_get+0x3e/0x2f0
[ 131.472066] ? bpf_task_storage_get+0x231/0x2f0
[ 131.472066] bpf_prog_ed7a5f33cc9fefab_foo+0x30/0x32
[ 131.472066] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d
...
[ 131.417145] dump_stack_lvl+0x53/0x70
[ 131.417145] bpf_task_storage_get+0x3e/0x2f0
[ 131.417145] ? selinux_netlbl_socket_post_create+0xab/0x150
[ 131.417145] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x60/0x6d
The stack dump works fine out of main prog and out of subprog.
The key difference it to pretend to have stack_depth=0,
so there is no adjustment to %rsp,
but point %rbp to per-cpu private stack and grow it _up_.
For the main prog %rbp points to the bottom of priv stack
plus stack_depth it needs,
so all bpf insns that do r10-off access the bottom of that priv stack.
When subprog is called it does 'add %rbp, its_stack_depth' and
in turn it's using memory above the bottom of the priv stack.
That seems to work, but exceptions and tailcalls are broken.
I ran out of time today to debug.
Pls see the attached patch.
The core part of the code is below:
EMIT1(0x55); /* push rbp */ - EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp
*/ + if (tail_call_reachable || !bpf_prog->aux->priv_stack_ptr) { +
EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ + } else { + if
(!is_subprog) { + /* mov rsp, pcpu_priv_stack_bottom */ + void __percpu
*priv_frame_ptr = + bpf_prog->aux->priv_stack_ptr +
round_up(stack_depth, 8); + + /* movabs sp, priv_frame_ptr */ +
emit_mov_imm64(&prog, AUX_REG, (long) priv_frame_ptr >> 32, + (u32)
(long) priv_frame_ptr); + + /* add <aux_reg>, gs:[<off>] */ +
EMIT2(0x65, 0x4c); + EMIT3(0x03, 0x1c, 0x25); + EMIT((u32)(unsigned
long)&this_cpu_off, 4); + /* mov rbp, aux_reg */ + EMIT3(0x4c, 0x89,
0xdd); + } else { + /* add rbp, stack_depth */ + EMIT3_off32(0x48, 0x81,
0xC5, round_up(stack_depth, 8)); + } + }
your mailer garbled the diff.
Sorry, I just copy-paste from your attached code. It shows properly
when I send email. I guess, I need to ensure I use proper format
in my editor.
So for main program, we have
push rbp rbp = per_cpu_ptr(priv_stack_ptr + stack_size) ... What will
happen we have an interrupt like below? push rbp rbp =
per_cpu_ptr(priv_stack_ptr + stack_size) <=== interrupt happens here ...
If we need to dump the stack trace at interrupt point then unwinder may
have difficulty to find the proper stack trace since *rbp is a arbitrary
value and *(rbp + 8) will not have proper func return address. Does this
make sense?
Hard to read above... but I think you're saying that rbp will point
Sorry again. Formating issue again.
to priv stack, irq happens and unwinder cannot work ?
Yes. I was also expecting it to break, but orc unwinder
with fallback to fp somehow did it correctly. See above stack dumps.
For the top frame the unwinder starts from SP, so it's fine,
but for the subprog 'foo' above the 'push rbp' pushes the
addr of priv stack, so the chain should be broken,
but the printed stack is correct, so I'm puzzled why it worked :)
We still have issues here. With 'rbp = ...' approach, I got
stack:
[ 53.429814] Call Trace:
[ 53.430177] <TASK>
[ 53.430498] dump_stack_lvl+0x52/0x70
[ 53.431067] bpf_task_storage_get+0x41/0x120
[ 53.431680] bpf_prog_71392c3ef5437fd9_foo+0x30/0x32
[ 53.432404] bpf_prog_8c4f9bc79da6c27e_socket_post_create+0x68/0x6d
[ 53.433241] ? bpf_trampoline_6442549714+0x68/0x10d
[ 53.433879] ? bpf_lsm_socket_post_create+0x9/0x20
[ 53.434512] ? security_socket_post_create+0x6e/0xd0
[ 53.435166] ? __sock_create+0x19e/0x2d0
[ 53.435686] ? __sys_socket+0x56/0xd0
[ 53.436176] ? __x64_sys_socket+0x19/0x30
[ 53.436702] ? do_syscall_64+0x58/0xf0
[ 53.437201] ? clear_bhb_loop+0x45/0xa0
[ 53.437746] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 53.438488] </TASK>
With the original kernel plus the following hack:
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -255,6 +255,7 @@ BPF_CALL_5(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
return (unsigned long)NULL;
+ dump_stack();
bpf_task_storage_lock();
data = __bpf_task_storage_get(map, task, value, flags,
gfp_flags, true);
I got stack trace:
[ 32.146519] Call Trace:
[ 32.146979] <TASK>
[ 32.147356] dump_stack_lvl+0x52/0x70
[ 32.147984] bpf_task_storage_get+0x41/0x120
[ 32.148741] bpf_prog_3c50a12b50fe949a_socket_post_create+0x5d/0xaa
[ 32.149844] bpf_trampoline_6442512791+0x68/0x10d
[ 32.150679] bpf_lsm_socket_post_create+0x9/0x20
[ 32.151451] security_socket_post_create+0x6e/0xd0
[ 32.152320] __sock_create+0x19e/0x2d0
[ 32.153059] __sys_socket+0x56/0xd0
[ 32.153779] __x64_sys_socket+0x19/0x30
[ 32.154561] do_syscall_64+0x58/0xf0
[ 32.155225] ? clear_bhb_loop+0x45/0xa0
[ 32.155970] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 32.156864] RIP: 0033:0x7f580d11385b
[ 32.157554] Code: 8b 54 24 08 64 48 2b 14 25 28 00 00 00 75 05 48 83 c4 18 c3 67 e8 65 d0 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 29 00 00 00 0f 05 <48> 3d 8
[ 32.160990] RSP: 002b:00007f58005ffea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
[ 32.162500] RAX: ffffffffffffffda RBX: 00007f5800600cdc RCX: 00007f580d11385b
[ 32.163907] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000002
[ 32.165292] RBP: 00007f58005ffed0 R08: 0000000000000000 R09: 00007f58006006c0
[ 32.166608] R10: 0000000000000008 R11: 0000000000000246 R12: ffffffffffffff80
[ 32.167898] R13: 0000000000000002 R14: 00007ffc461fba30 R15: 00007f57ffe00000
[ 32.169119] </TASK>
The difference is after bpf prog, the kernel stack trace
does not have '?' while with private stack and 'rbp = priv_stack_ptr'
approach, we have '?'.
The reason is that for private stack, when unwinder find the 'rbp', it
is not able to find the previous frame return address and previous proper 'rbp'.