Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/9/24 10:42 PM, Andrii Nakryiko wrote:
On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
Salvatore Benedetto reported an issue that when doing syscall tracepoint
tracing the kernel stack is empty. For example, using the following
command line
   bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
the output will be
===
   Kernel Stack
===

Further analysis shows that pt_regs used for bpf syscall tracepoint
tracing is from the one constructed during user->kernel transition.
The call stack looks like
   perf_syscall_enter+0x88/0x7c0
   trace_sys_enter+0x41/0x80
   syscall_trace_enter+0x100/0x160
   do_syscall_64+0x38/0xf0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

The ip address stored in pt_regs is from user space hence no kernel
stack is printed.

To fix the issue, we need to use kernel address from pt_regs.
In kernel repo, there are already a few cases like this. For example,
in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
instances are used to supply ip address or use ip address to construct
call stack.

The patch follows the above example by using a fake pt_regs.
The pt_regs is stored in local stack since the syscall tracepoint
tracing is in process context and there are no possibility that
different concurrent syscall tracepoint tracing could mess up with each
other. This is similar to a perf_fetch_caller_regs() use case in
kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
where a local pt_regs is used.

With this patch, for the above bpftrace script, I got the following output
===
   Kernel Stack

         syscall_trace_enter+407
         syscall_trace_enter+407
         do_syscall_64+74
         entry_SYSCALL_64_after_hwframe+75
===

Reported-by: Salvatore Benedetto <salvabenedetto@xxxxxxxx>
Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
---
  kernel/trace/trace_syscalls.c | 5 ++++-
  1 file changed, 4 insertions(+), 1 deletion(-)

Note, we need to solve the same for perf_call_bpf_exit().

pw-bot: cr

BTW, we lived with this bug for years, so I suggest basing your fix on
top of bpf-next/master, no bpf/master, which will give people a bit of
time to validate that the fix works as expected and doesn't produce
any undesirable side effects, before this makes it into the final
Linux release.

Yes, I did. See I indeed use 'bpf-next' in subject above.


diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 9c581d6da843..063f51952d49 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
let's also drop struct pt_regs * argument into
perf_call_bpf_{enter,exit}(), they are not actually used anymore

                 int syscall_nr;
                 unsigned long args[SYSCALL_DEFINE_MAXARGS];
         } __aligned(8) param;
+       struct pt_regs fake_regs;
         int i;

         BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));

         /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. &param) */
-       *(struct pt_regs **)&param = regs;
+       memset(&fake_regs, 0, sizeof(fake_regs));
sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
336 bytes, so these memset(0) calls are not free for sure.

But we don't need to do this unnecessary work all the time.

I initially was going to suggest to use get_bpf_raw_tp_regs() from
kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
memset(0) and used to initialize these minimal "fake regs".

But, it turns out we don't need to do even that. Note
perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
and if you pass a valid pointer there, it will return "fake regs"
struct to be used. We already use that functionality in
perf_trace_##call in include/trace/perf.h (i.e., non-syscall
tracepoints), so this seems to be a perfect fit.

+       perf_fetch_caller_regs(&fake_regs);
+       *(struct pt_regs **)&param = &fake_regs;
         param.syscall_nr = rec->nr;
         for (i = 0; i < sys_data->nb_args; i++)
                 param.args[i] = rec->args[i];
--
2.43.5





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux