Salvatore Benedetto reported an issue that when doing syscall tracepoint tracing the kernel stack is empty. For example, using the following command line bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' the output will be === Kernel Stack === Further analysis shows that pt_regs used for bpf syscall tracepoint tracing is from the one constructed during user->kernel transition. The call stack looks like perf_syscall_enter+0x88/0x7c0 trace_sys_enter+0x41/0x80 syscall_trace_enter+0x100/0x160 do_syscall_64+0x38/0xf0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The ip address stored in pt_regs is from user space hence no kernel stack is printed. To fix the issue, we need to use kernel address from pt_regs. In kernel repo, there are already a few cases like this. For example, in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) instances are used to supply ip address or use ip address to construct call stack. The patch follows the above example by using a fake pt_regs. The pt_regs is stored in local stack since the syscall tracepoint tracing is in process context and there are no possibility that different concurrent syscall tracepoint tracing could mess up with each other. This is similar to a perf_fetch_caller_regs() use case in kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() where a local pt_regs is used. With this patch, for the above bpftrace script, I got the following output === Kernel Stack syscall_trace_enter+407 syscall_trace_enter+407 do_syscall_64+74 entry_SYSCALL_64_after_hwframe+75 === Reported-by: Salvatore Benedetto <salvabenedetto@xxxxxxxx> Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx> Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx> --- kernel/trace/trace_syscalls.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 9c581d6da843..063f51952d49 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re int syscall_nr; unsigned long args[SYSCALL_DEFINE_MAXARGS]; } __aligned(8) param; + struct pt_regs fake_regs; int i; BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ - *(struct pt_regs **)¶m = regs; + memset(&fake_regs, 0, sizeof(fake_regs)); + perf_fetch_caller_regs(&fake_regs); + *(struct pt_regs **)¶m = &fake_regs; param.syscall_nr = rec->nr; for (i = 0; i < sys_data->nb_args; i++) param.args[i] = rec->args[i]; -- 2.43.5