Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 9 Sep 2024 22:34:39 -0700

On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
> Salvatore Benedetto reported an issue that when doing syscall tracepoint
> tracing the kernel stack is empty. For example, using the following
> command line
>   bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
> the output will be
> ===
>   Kernel Stack
> ===
>
> Further analysis shows that pt_regs used for bpf syscall tracepoint
> tracing is from the one constructed during user->kernel transition.
> The call stack looks like
>   perf_syscall_enter+0x88/0x7c0
>   trace_sys_enter+0x41/0x80
>   syscall_trace_enter+0x100/0x160
>   do_syscall_64+0x38/0xf0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The ip address stored in pt_regs is from user space hence no kernel
> stack is printed.
>
> To fix the issue, we need to use kernel address from pt_regs.
> In kernel repo, there are already a few cases like this. For example,
> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
> instances are used to supply ip address or use ip address to construct
> call stack.
>
> The patch follows the above example by using a fake pt_regs.
> The pt_regs is stored in local stack since the syscall tracepoint
> tracing is in process context and there are no possibility that
> different concurrent syscall tracepoint tracing could mess up with each
> other. This is similar to a perf_fetch_caller_regs() use case in
> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
> where a local pt_regs is used.
>
> With this patch, for the above bpftrace script, I got the following output
> ===
>   Kernel Stack
>
>         syscall_trace_enter+407
>         syscall_trace_enter+407
>         do_syscall_64+74
>         entry_SYSCALL_64_after_hwframe+75
> ===
>
> Reported-by: Salvatore Benedetto <salvabenedetto@xxxxxxxx>
> Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
> ---
>  kernel/trace/trace_syscalls.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>

Note, we need to solve the same for perf_call_bpf_exit().

pw-bot: cr

> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 9c581d6da843..063f51952d49 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re

let's also drop struct pt_regs * argument into
perf_call_bpf_{enter,exit}(), they are not actually used anymore

>                 int syscall_nr;
>                 unsigned long args[SYSCALL_DEFINE_MAXARGS];
>         } __aligned(8) param;
> +       struct pt_regs fake_regs;
>         int i;
>
>         BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
>
>         /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. &param) */
> -       *(struct pt_regs **)&param = regs;
> +       memset(&fake_regs, 0, sizeof(fake_regs));

sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
336 bytes, so these memset(0) calls are not free for sure.

But we don't need to do this unnecessary work all the time.

I initially was going to suggest to use get_bpf_raw_tp_regs() from
kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
memset(0) and used to initialize these minimal "fake regs".

But, it turns out we don't need to do even that. Note
perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
and if you pass a valid pointer there, it will return "fake regs"
struct to be used. We already use that functionality in
perf_trace_##call in include/trace/perf.h (i.e., non-syscall
tracepoints), so this seems to be a perfect fit.

> +       perf_fetch_caller_regs(&fake_regs);
> +       *(struct pt_regs **)&param = &fake_regs;
>         param.syscall_nr = rec->nr;
>         for (i = 0; i < sys_data->nb_args; i++)
>                 param.args[i] = rec->args[i];
> --
> 2.43.5
>