Re: [PATCH bpf-next] bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 10 Sep 2024 11:25:15 -0700

On Tue, Sep 10, 2024 at 11:22 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
> On 9/10/24 9:50 AM, Andrii Nakryiko wrote:
> > On Tue, Sep 10, 2024 at 8:25 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>
> >> On 9/9/24 10:42 PM, Andrii Nakryiko wrote:
> >>> On Mon, Sep 9, 2024 at 10:34 PM Andrii Nakryiko
> >>> <andrii.nakryiko@xxxxxxxxx> wrote:
> >>>> On Mon, Sep 9, 2024 at 8:43 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>>>> Salvatore Benedetto reported an issue that when doing syscall tracepoint
> >>>>> tracing the kernel stack is empty. For example, using the following
> >>>>> command line
> >>>>>     bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
> >>>>> the output will be
> >>>>> ===
> >>>>>     Kernel Stack
> >>>>> ===
> >>>>>
> >>>>> Further analysis shows that pt_regs used for bpf syscall tracepoint
> >>>>> tracing is from the one constructed during user->kernel transition.
> >>>>> The call stack looks like
> >>>>>     perf_syscall_enter+0x88/0x7c0
> >>>>>     trace_sys_enter+0x41/0x80
> >>>>>     syscall_trace_enter+0x100/0x160
> >>>>>     do_syscall_64+0x38/0xf0
> >>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>>>>
> >>>>> The ip address stored in pt_regs is from user space hence no kernel
> >>>>> stack is printed.
> >>>>>
> >>>>> To fix the issue, we need to use kernel address from pt_regs.
> >>>>> In kernel repo, there are already a few cases like this. For example,
> >>>>> in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
> >>>>> instances are used to supply ip address or use ip address to construct
> >>>>> call stack.
> >>>>>
> >>>>> The patch follows the above example by using a fake pt_regs.
> >>>>> The pt_regs is stored in local stack since the syscall tracepoint
> >>>>> tracing is in process context and there are no possibility that
> >>>>> different concurrent syscall tracepoint tracing could mess up with each
> >>>>> other. This is similar to a perf_fetch_caller_regs() use case in
> >>>>> kernel/trace/trace_event_perf.c with function perf_ftrace_function_call()
> >>>>> where a local pt_regs is used.
> >>>>>
> >>>>> With this patch, for the above bpftrace script, I got the following output
> >>>>> ===
> >>>>>     Kernel Stack
> >>>>>
> >>>>>           syscall_trace_enter+407
> >>>>>           syscall_trace_enter+407
> >>>>>           do_syscall_64+74
> >>>>>           entry_SYSCALL_64_after_hwframe+75
> >>>>> ===
> >>>>>
> >>>>> Reported-by: Salvatore Benedetto <salvabenedetto@xxxxxxxx>
> >>>>> Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> >>>>> Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
> >>>>> ---
> >>>>>    kernel/trace/trace_syscalls.c | 5 ++++-
> >>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>
> >>>> Note, we need to solve the same for perf_call_bpf_exit().
> >>>>
> >>>> pw-bot: cr
> >>>>
> >>> BTW, we lived with this bug for years, so I suggest basing your fix on
> >>> top of bpf-next/master, no bpf/master, which will give people a bit of
> >>> time to validate that the fix works as expected and doesn't produce
> >>> any undesirable side effects, before this makes it into the final
> >>> Linux release.
> >> Yes, I did. See I indeed use 'bpf-next' in subject above.
> > Huh, strange, I actually tried to apply your patch to bpf-next/master
> > and it didn't apply cleanly. It did apply to bpf/master, though, which
> > is why I assumed you based it off of bpf/master.
>
> Interesting. The following is my git history:
>
> 7b71206057440d9559ecb9cd02d891f46927b272 (HEAD -> trace_syscall) bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
> 41d0c4677feee1ea063e0f2c2af72dc953b1f1cc (origin/master, origin/HEAD, master) libbpf: Fix some typos in comments
> 72d8508ecd3b081dba03ec00930c6b07c1ad55d3 MAINTAINERS: BPF ARC JIT: Update my e-mail address
> bee109b7b3e50739b88252a219fa07ecd78ad628 bpf: Fix error message on kfunc arg type mismatch
> ...
>
> Not sure what is going on ...
>

Doesn't matter, maybe my local bpf-next/master branch was screwed up.
Just send a v2 when you are ready and I'll try again :)

>
> >
> >>>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> >>>>> index 9c581d6da843..063f51952d49 100644
> >>>>> --- a/kernel/trace/trace_syscalls.c
> >>>>> +++ b/kernel/trace/trace_syscalls.c
> >>>>> @@ -559,12 +559,15 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
> >>>> let's also drop struct pt_regs * argument into
> >>>> perf_call_bpf_{enter,exit}(), they are not actually used anymore
> >>>>
> >>>>>                   int syscall_nr;
> >>>>>                   unsigned long args[SYSCALL_DEFINE_MAXARGS];
> >>>>>           } __aligned(8) param;
> >>>>> +       struct pt_regs fake_regs;
> >>>>>           int i;
> >>>>>
> >>>>>           BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *));
> >>>>>
> >>>>>           /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. &param) */
> >>>>> -       *(struct pt_regs **)&param = regs;
> >>>>> +       memset(&fake_regs, 0, sizeof(fake_regs));
> >>>> sizeof(struct pt_regs) == 168 on x86-64, and on arm64 it's a whopping
> >>>> 336 bytes, so these memset(0) calls are not free for sure.
> >>>>
> >>>> But we don't need to do this unnecessary work all the time.
> >>>>
> >>>> I initially was going to suggest to use get_bpf_raw_tp_regs() from
> >>>> kernel/trace/bpf_trace.c to get a temporary pt_regs that was already
> >>>> memset(0) and used to initialize these minimal "fake regs".
> >>>>
> >>>> But, it turns out we don't need to do even that. Note
> >>>> perf_trace_buf_alloc(), it has `struct pt_regs **` second argument,
> >>>> and if you pass a valid pointer there, it will return "fake regs"
> >>>> struct to be used. We already use that functionality in
> >>>> perf_trace_##call in include/trace/perf.h (i.e., non-syscall
> >>>> tracepoints), so this seems to be a perfect fit.
> >>>>
> >>>>> +       perf_fetch_caller_regs(&fake_regs);
> >>>>> +       *(struct pt_regs **)&param = &fake_regs;
> >>>>>           param.syscall_nr = rec->nr;
> >>>>>           for (i = 0; i < sys_data->nb_args; i++)
> >>>>>                   param.args[i] = rec->args[i];
> >>>>> --
> >>>>> 2.43.5
> >>>>>