adding Jiri and bpftrace folks On Fri, Aug 23, 2024 at 10:44 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > On Fri, 23 Aug 2024 21:53:00 +0800 > Tianyi Liu <i.pear@xxxxxxxxxxx> wrote: > > > U(ret)probes are designed to be filterable using the PID, which is the > > second parameter in the perf_event_open syscall. Currently, uprobe works > > well with the filtering, but uretprobe is not affected by it. This often > > leads to users being disturbed by events from uninterested processes while > > using uretprobe. > > > > We found that the filter function was not invoked when uretprobe was > > initially implemented, and this has been existing for ten years. We have > > tested the patch under our workload, binding eBPF programs to uretprobe > > tracepoints, and confirmed that it resolved our problem. > > Is this eBPF related problem? It seems only perf record is also affected. > Let me try. > > > > > > Following are the steps to reproduce the issue: > > > > Step 1. Compile the following reproducer program: > > ``` > > > > int main() { > > printf("pid: %d\n", getpid()); > > while (1) { > > sleep(2); > > void *ptr = malloc(1024); > > free(ptr); > > } > > } > > ``` > > We will then use uretprobe to trace the `malloc` function. > > OK, and run perf probe to add an event on malloc's return. > > $ sudo ~/bin/perf probe -x ./malloc-run --add malloc%return > Added new event: > probe_malloc:malloc__return (on malloc%return in /home/mhiramat/ksrc/linux/malloc-run) > > You can now use it in all perf tools, such as: > > perf record -e probe_malloc:malloc__return -aR sleep 1 > > > > > Step 2. Run two instances of the reproducer program and record their PIDs. > > $ ./malloc-run & ./malloc-run & > [1] 93927 > [2] 93928 > pid: 93927 > pid: 93928 > > And trace one of them; > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ] > > And dump the data; > > $ sudo ~/bin/perf script > malloc-run 93928 [004] 351736.730649: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351736.730696: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > malloc-run 93928 [004] 351738.730857: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351738.730883: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > malloc-run 93928 [004] 351740.731110: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351740.731127: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > Hmm, it seems to trace one pid data. (without this change) > If this changes eBPF behavior, I would like to involve eBPF people to ask > this is OK. As far as from the viewpoint of perf tool, current code works. > > But I agree that current code is a bit strange. Oleg, do you know anything? > > Thank you, > > > > > Step 3. Use uretprobe to trace each of the two running reproducers > > separately. We use bpftrace to make it easier to reproduce. Please run two > > instances of bpftrace simultaneously: the first instance filters events > > from PID1, and the second instance filters events from PID2. > > > > The expected behavior is that each bpftrace instance would only print > > events matching its respective PID filter. However, in practice, both > > bpftrace instances receive events from both processes, the PID filter is > > ineffective at this moment: > > > > Before: > > ``` > > PID1=55256 > > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=55256 > > time=2 pid=55273 > > time=2 pid=55256 > > time=4 pid=55273 > > time=4 pid=55256 > > time=6 pid=55273 > > time=6 pid=55256 > > > > PID2=55273 > > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=55273 > > time=0 pid=55256 > > time=2 pid=55273 > > time=2 pid=55256 > > time=4 pid=55273 > > time=4 pid=55256 > > time=6 pid=55273 > > time=6 pid=55256 > > ``` This is a bit confusing, because even if the kernel-side uretprobe handler doesn't do the filtering by itself, uprobe subsystem shouldn't install breakpoints on processes which don't have uretprobe requested for (unless I'm missing something, of course). It still needs to be fixed like you do in your patch, though. Even more, we probably need a similar UPROBE_HANDLER_REMOVE handling in handle_uretprobe_chain() to clean up breakpoint for processes which don't have uretprobe attached anymore (but I think that's a separate follow up). Anyways, I think fixing this won't affect BPF in the sense that this is clearly a bug and shouldn't happen. Furthermore, in multi-uprobe/multi-uretprobe implementation we already filter this out correctly, so I don't think anyone does (neither should) rely on this buggy behavior. > > > > After: Both bpftrace instances will show the expected behavior, only > > printing events from the PID specified by their respective filters: > > ``` > > PID1=1621 > > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=1621 > > time=2 pid=1621 > > time=4 pid=1621 > > time=6 pid=1621 > > > > PID2=1633 > > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=1633 > > time=2 pid=1633 > > time=4 pid=1633 > > time=6 pid=1633 > > ``` > > > > Fixes: c1ae5c75e103 ("uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()") > > Cc: Alban Crequy <albancrequy@xxxxxxxxxxxxxxxxxxx> > > Signed-off-by: Francis Laniel <flaniel@xxxxxxxxxxxxxxxxxxx> > > Signed-off-by: Tianyi Liu <i.pear@xxxxxxxxxxx> > > --- > > Changes in v2: > > - Drop cover letter and update commit message. > > - Link to v1: https://lore.kernel.org/linux-trace-kernel/ME0P300MB04166144CDF92A72B9E1BAEA9D8F2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > --- > > kernel/trace/trace_uprobe.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > index c98e3b3386ba..c7e2a0962928 100644 > > --- a/kernel/trace/trace_uprobe.c > > +++ b/kernel/trace/trace_uprobe.c > > @@ -1443,6 +1443,9 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, > > struct pt_regs *regs, > > struct uprobe_cpu_buffer **ucbp) > > { > > + if (!uprobe_perf_filter(&tu->consumer, 0, current->mm)) > > + return; > > + > > __uprobe_perf_func(tu, func, regs, ucbp); > > } > > > > -- > > 2.34.1 > > > > > -- > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx> >