Re: [PATCH v2] tracing/uprobe: Add missing PID filter for uretprobe

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 3 Sep 2024 11:11:06 -0700

On Tue, Sep 3, 2024 at 11:09 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Sep 2, 2024 at 2:11 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> >
> > On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote:
> > > On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote:
> > > > > The whole discussion was very confusing (yes, I too contributed to the
> > > > > confusion ;), let me try to summarise.
> > > > >
> > > > > > U(ret)probes are designed to be filterable using the PID, which is the
> > > > > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > > > > well with the filtering, but uretprobe is not affected by it.
> > > > >
> > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> > > > > misunderstands the purpose of uprobe_perf_filter().
> > > > >
> > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> > > > > does the filtering by the PID, it doesn't. We can simply kill this function
> > > > > and perf will work correctly. The perf layer in __uprobe_perf_func() does
> > > > > the filtering when perf_event->hw.target != NULL.
> > > > >
> > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> > > > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> > > > > unapply_uprobe() in handler_chain().
> > > > >
> > > > > Suppose you do, say,
> > > > >
> > > > >       $ perf probe -x /path/to/libc some_hot_function
> > > > > or
> > > > >       $ perf probe -x /path/to/libc some_hot_function%return
> > > > > then
> > > > >       $perf record -e ... -p 1
> > > > >
> > > > > to trace the usage of some_hot_function() in the init process. Everything
> > > > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> > > > >
> > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> > > > > So the child C will hit this breakpoint and cal handle_swbp/etc for no
> > > > > reason every time it calls some_hot_function(), not good.
> > > > >
> > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> > > > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> > > > > call unapply_uprobe() which will remove this breakpoint from C->mm.
> > > >
> > > > thanks for the info, I wasn't aware this was the intention
> > > >
> > > > uprobe_multi does not have perf event mechanism/check, so it's using
> > > > the filter function to do the process filtering.. which is not working
> > > > properly as you pointed out earlier
> > >
> > > So this part I don't completely get. I get that using task->mm
> > > comparison is wrong due to CLONE_VM, but why same_thread_group() check
> > > is wrong? I.e., why task->signal comparison is wrong?
> >
> > the way I understand it is that we take the group leader task and
> > store it in bpf_uprobe_multi_link::task
> >
> > but it can exit while the rest of the threads is still running so
> > the uprobe_multi_link_filter won't match them (leader->mm is NULL)
>
> Aren't we conflating two things here? Yes, from what Oleg explained,
> it's clear that using task->mm is wrong. So that is what I feel is the
> main issue. We shouldn't use task->mm at all, only task->signal should
> be used instead. We should fix that (in bpf tree, please).
>
> But I don't get the concern about linux->mm or linux->signal becoming

correction, we shouldn't worry about *linux->signal* becoming NULL.
linux->mm can become NULL, but we don't care about that (once we fix
filtering logic in multi-uprobe).

> NULL because of a task existing. Look at put_task_struct(), it WILL
> NOT call __put_task_struct() (which then calls put_signal_struct()),
> so task->signal at least will be there and valid until multi-uprobe is
> detached and we call put_task().
>
> So. Can you please send fixes against the bpf tree, switching to
> task->signal? And maybe also include the fix to prevent
> UPROBE_HANDLER_REMOVE to be returned from the BPF program?
>
> This thread is almost 50 emails deep now, we should break out of it.
> We can argue on your actual fixes. :)
>
> >
> > Oleg suggested change below (in addition to same_thread_group change)
> > to take that in account
> >
> > jirka
> >
> >
> > ---
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 98e395f1baae..9e6b390aa6da 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx
> >                          struct mm_struct *mm)
> >  {
> >         struct bpf_uprobe *uprobe;
> > +       struct task_struct *task, *t;
> > +       bool ret = false;
> >
> >         uprobe = container_of(con, struct bpf_uprobe, consumer);
> > -       return uprobe->link->task->mm == mm;
> > +       task = uprobe->link->task;
> > +
> > +       rcu_read_lock();
> > +       for_each_thread(task, t) {
> > +               struct mm_struct *mm = READ_ONCE(t->mm);
> > +               if (mm) {
> > +                       ret = t->mm == mm;
> > +                       break;
> > +               }
> > +       }
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> >  }
> >
> >  static int