Re: [PATCH v2] tracing/uprobe: Add missing PID filter for uretprobe

Tianyi Liu <i.pear@xxxxxxxxxxx> · Sat, 24 Aug 2024 13:49:26 +0800

Hi Masami and Andrii:

I would like to share more information and ideas, but I'm possibly wrong.

> > U(ret)probes are designed to be filterable using the PID, which is the
> > second parameter in the perf_event_open syscall. Currently, uprobe works
> > well with the filtering, but uretprobe is not affected by it. This often
> > leads to users being disturbed by events from uninterested processes while
> > using uretprobe.
> >
> > We found that the filter function was not invoked when uretprobe was
> > initially implemented, and this has been existing for ten years. We have
> > tested the patch under our workload, binding eBPF programs to uretprobe
> > tracepoints, and confirmed that it resolved our problem.
> 
> Is this eBPF related problem? It seems only perf record is also affected.
> Let me try.

I guess it should be a general issue and is not specific to BPF, because
the BPF handler is only a event "consumer".

> 
> And trace one of them;
> 
> $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> 

A key trigger here is that the two tracer instances (either `bpftrace` or
`perf record`) must be running *simultaneously*. One of them should use
PID1 as filter, while the other uses PID2.

I think the reason why only tracing PID1 fails to trigger the bug is that,
uprobe uses some form of copy-on-write mechanism to create independent
.text pages for the traced process. For example, if only PID1 is being
traced, then only the libc.so used by PID1 is patched. Other processes
will continue to use the unpatched original libc.so, so the event isn't
triggered. In my reproduction example, the two bpftrace instances must be
running at the same moment.

> This is a bit confusing, because even if the kernel-side uretprobe
> handler doesn't do the filtering by itself, uprobe subsystem shouldn't
> install breakpoints on processes which don't have uretprobe requested
> for (unless I'm missing something, of course).

There're two tracers, one attached to PID1, and the other attached
to PID2, as described above.

> It still needs to be fixed like you do in your patch, though. Even
> more, we probably need a similar UPROBE_HANDLER_REMOVE handling in
> handle_uretprobe_chain() to clean up breakpoint for processes which
> don't have uretprobe attached anymore (but I think that's a separate
> follow up).

Agreed, the implementation of uretprobe should be almost the same as
uprobe, but it seems uretprobe was ignored in previous modifications.

> $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ]
> 
> And dump the data;
> 
> $ sudo ~/bin/perf script
>       malloc-run   93928 [004] 351736.730649:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351736.730696:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>       malloc-run   93928 [004] 351738.730857:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351738.730883:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>       malloc-run   93928 [004] 351740.731110:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351740.731127:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
> 
> Hmm, it seems to trace one pid data. (without this change)
> If this changes eBPF behavior, I would like to involve eBPF people to ask
> this is OK. As far as from the viewpoint of perf tool, current code works.

I tried this and also couldn't reproduce the bug. Even when running two
perf instances simultaneously, `perf record` (or perhaps `perf trace` for
convenience) only outputs events from the corresponding PID as expected.
I initially suspected that perf might be applying another filter in user
space, but that doesn't seem to be the case. I'll need to conduct further
debugging, which might take some time.

I also tried combining bpftrace with `perf trace`. Specifically, I used
`perf trace` for PID1 and bpftrace for PID2. `perf trace` still only
outputs events from PID1, but bpftrace prints events from both PIDs.
I'm not yet sure why this is happening.

Thanks so much,