On Sat, 24 Aug 2024 13:49:26 +0800 Tianyi Liu <i.pear@xxxxxxxxxxx> wrote: > Hi Masami and Andrii: > > I would like to share more information and ideas, but I'm possibly wrong. > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > well with the filtering, but uretprobe is not affected by it. This often > > > leads to users being disturbed by events from uninterested processes while > > > using uretprobe. > > > > > > We found that the filter function was not invoked when uretprobe was > > > initially implemented, and this has been existing for ten years. We have > > > tested the patch under our workload, binding eBPF programs to uretprobe > > > tracepoints, and confirmed that it resolved our problem. > > > > Is this eBPF related problem? It seems only perf record is also affected. > > Let me try. > > I guess it should be a general issue and is not specific to BPF, because > the BPF handler is only a event "consumer". > > > > > And trace one of them; > > > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > > > > A key trigger here is that the two tracer instances (either `bpftrace` or > `perf record`) must be running *simultaneously*. One of them should use > PID1 as filter, while the other uses PID2. Even if I run `perf trace record` simultaneously, it filters the event correctly. I just ran; sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93927 -o trace1.data -- sleep 50 & sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 -o trace2.data -- sleep 50 & And dump trace1.data and trace2.data by `perf script`. > > I think the reason why only tracing PID1 fails to trigger the bug is that, > uprobe uses some form of copy-on-write mechanism to create independent > .text pages for the traced process. For example, if only PID1 is being > traced, then only the libc.so used by PID1 is patched. Other processes > will continue to use the unpatched original libc.so, so the event isn't > triggered. In my reproduction example, the two bpftrace instances must be > running at the same moment. > > > This is a bit confusing, because even if the kernel-side uretprobe > > handler doesn't do the filtering by itself, uprobe subsystem shouldn't > > install breakpoints on processes which don't have uretprobe requested > > for (unless I'm missing something, of course). > > There're two tracers, one attached to PID1, and the other attached > to PID2, as described above. Yeah, but perf works fine. Maybe perf only enables its ring buffer for the specific process and read from the specific ring buffer (Note, perf has per-process ring buffer IIRC.) How does the bpftracer implement it? > > > It still needs to be fixed like you do in your patch, though. Even > > more, we probably need a similar UPROBE_HANDLER_REMOVE handling in > > handle_uretprobe_chain() to clean up breakpoint for processes which > > don't have uretprobe attached anymore (but I think that's a separate > > follow up). > > Agreed, the implementation of uretprobe should be almost the same as > uprobe, but it seems uretprobe was ignored in previous modifications. OK, I just have a confirmation from BPF people, because I could not reproduce the issue with perf tool. > > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > > ^C[ perf record: Woken up 1 times to write data ] > > [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ] > > > > And dump the data; > > > > $ sudo ~/bin/perf script > > malloc-run 93928 [004] 351736.730649: raw_syscalls:sys_exit: NR 230 = 0 > > malloc-run 93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > > malloc-run 93928 [004] 351736.730696: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > malloc-run 93928 [004] 351738.730857: raw_syscalls:sys_exit: NR 230 = 0 > > malloc-run 93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > > malloc-run 93928 [004] 351738.730883: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > malloc-run 93928 [004] 351740.731110: raw_syscalls:sys_exit: NR 230 = 0 > > malloc-run 93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > > malloc-run 93928 [004] 351740.731127: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > > > Hmm, it seems to trace one pid data. (without this change) > > If this changes eBPF behavior, I would like to involve eBPF people to ask > > this is OK. As far as from the viewpoint of perf tool, current code works. > > I tried this and also couldn't reproduce the bug. Even when running two > perf instances simultaneously, `perf record` (or perhaps `perf trace` for > convenience) only outputs events from the corresponding PID as expected. > I initially suspected that perf might be applying another filter in user > space, but that doesn't seem to be the case. I'll need to conduct further > debugging, which might take some time. > > I also tried combining bpftrace with `perf trace`. Specifically, I used > `perf trace` for PID1 and bpftrace for PID2. `perf trace` still only > outputs events from PID1, but bpftrace prints events from both PIDs. > I'm not yet sure why this is happening. Yeah, if perf only reads the specific process's ring buffer, it should work without this filter. Thanks, > > Thanks so much, -- Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>