On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > Salvatore Benedetto reported an issue that when doing syscall tracepoint > tracing the kernel stack is empty. For example, using the following > command line > bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' > the output will be > === > Kernel Stack > === > > Further analysis shows that pt_regs used for bpf syscall tracepoint > tracing is from the one constructed during user->kernel transition. > The call stack looks like > perf_syscall_enter+0x88/0x7c0 > trace_sys_enter+0x41/0x80 > syscall_trace_enter+0x100/0x160 > do_syscall_64+0x38/0xf0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The ip address stored in pt_regs is from user space hence no kernel > stack is printed. > > To fix the issue, we need to use kernel address from pt_regs. > In kernel repo, there are already a few cases like this. For example, > in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) > instances are used to supply ip address or use ip address to construct > call stack. > > The patch follows the above example by using a fake pt_regs. > The pt_regs is stored in local stack since the syscall tracepoint > tracing is in process context and there are no possibility that > different concurrent syscall tracepoint tracing could mess up with each > other. This is similar to a perf_fetch_caller_regs() use case in > kernel/trace/trace_event_perf.c with function perf_ftrace_function_call() > where a local pt_regs is used. > > With this patch, for the above bpftrace script, I got the following output > === > Kernel Stack > > syscall_trace_enter+407 > syscall_trace_enter+407 > do_syscall_64+74 > entry_SYSCALL_64_after_hwframe+75 > === > > Reported-by: Salvatore Benedetto <salvabenedetto@xxxxxxxx> > Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx> > --- > kernel/trace/trace_syscalls.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > Note, we need to solve the same for perf_call_bpf_exit(). pw-bot: cr > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > index 9c581d6da843..063f51952d49 100644 > --- a/kernel/trace/trace_syscalls.c > +++ b/kernel/trace/trace_syscalls.c > @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re let's also drop struct pt_regs * argument into perf_call_bpf_{enter,exit}(), they are not actually used anymore > int syscall_nr; > unsigned long args[SYSCALL_DEFINE_MAXARGS]; > } __aligned(8) param; > + struct pt_regs fake_regs; > int i; > > BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); > > /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ > - *(struct pt_regs **)¶m = regs; > + memset(&fake_regs, 0, sizeof(fake_regs)); sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping 336 bytes, so these memset(0) calls are not free for sure. But we don't need to do this unnecessary work all the time. I initially was going to suggest to use get_bpf_raw_tp_regs() from kernel/trace/bpf_trace.c to get a temporary pt_regs that was already memset(0) and used to initialize these minimal "fake regs". But, it turns out we don't need to do even that. Note perf_trace_buf_alloc(), it has `struct pt_regs **` second argument, and if you pass a valid pointer there, it will return "fake regs" struct to be used. We already use that functionality in perf_trace_##call in include/trace/perf.h (i.e., non-syscall tracepoints), so this seems to be a perfect fit. > + perf_fetch_caller_regs(&fake_regs); > + *(struct pt_regs **)¶m = &fake_regs; > param.syscall_nr = rec->nr; > for (i = 0; i < sys_data->nb_args; i++) > param.args[i] = rec->args[i]; > -- > 2.43.5 >