Re: [PATCH v6 3/5] tracing/bpf-trace: Add support for faultable tracepoints

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 9 Sep 2024 09:53:01 -0700

On Mon, Sep 9, 2024 at 8:11 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> On 2024-09-04 21:21, Andrii Nakryiko wrote:
> > On Wed, Aug 28, 2024 at 7:42 AM Mathieu Desnoyers
> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >>
> >> In preparation for converting system call enter/exit instrumentation
> >> into faultable tracepoints, make sure that bpf can handle registering to
> >> such tracepoints by explicitly disabling preemption within the bpf
> >> tracepoint probes to respect the current expectations within bpf tracing
> >> code.
> >>
> >> This change does not yet allow bpf to take page faults per se within its
> >> probe, but allows its existing probes to connect to faultable
> >> tracepoints.
> >>
> >> Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoyers@xxxxxxxxxxxx/
> >> Co-developed-by: Michael Jeanson <mjeanson@xxxxxxxxxxxx>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> >> Signed-off-by: Michael Jeanson <mjeanson@xxxxxxxxxxxx>
> >> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
> >> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> >> Cc: Masami Hiramatsu <mhiramat@xxxxxxxxxx>
> >> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >> Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
> >> Cc: Yonghong Song <yhs@xxxxxx>
> >> Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
> >> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> >> Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
> >> Cc: Mark Rutland <mark.rutland@xxxxxxx>
> >> Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
> >> Cc: Namhyung Kim <namhyung@xxxxxxxxxx>
> >> Cc: bpf@xxxxxxxxxxxxxxx
> >> Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx>
> >> ---
> >> Changes since v4:
> >> - Use DEFINE_INACTIVE_GUARD.
> >> - Add brackets to multiline 'if' statements.
> >> Changes since v5:
> >> - Rebased on v6.11-rc5.
> >> - Pass the TRACEPOINT_MAY_FAULT flag directly to tracepoint_probe_register_prio_flags.
> >> ---
> >>   include/trace/bpf_probe.h | 21 ++++++++++++++++-----
> >>   kernel/trace/bpf_trace.c  |  2 +-
> >>   2 files changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> >> index a2ea11cc912e..cc96dd1e7c3d 100644
> >> --- a/include/trace/bpf_probe.h
> >> +++ b/include/trace/bpf_probe.h
> >> @@ -42,16 +42,27 @@
> >>   /* tracepoints with more than 12 arguments will hit build error */
> >>   #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
> >>
> >> -#define __BPF_DECLARE_TRACE(call, proto, args)                         \
> >> +#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags)               \
> >>   static notrace void                                                    \
> >>   __bpf_trace_##call(void *__data, proto)                                        \
> >>   {                                                                      \
> >> -       CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
> >> +       DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard);        \
> >> +                                                                       \
> >> +       if ((tp_flags) & TRACEPOINT_MAY_FAULT) {                        \
> >> +               might_fault();                                          \
> >> +               activate_guard(preempt_notrace, bpf_trace_guard)();     \
> >> +       }                                                               \
> >> +                                                                       \
> >> +       CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
> >>   }
> >>
> >>   #undef DECLARE_EVENT_CLASS
> >>   #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> >> -       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
> >> +       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)
> >> +
> >> +#undef DECLARE_EVENT_CLASS_MAY_FAULT
> >> +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \
> >> +       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), TRACEPOINT_MAY_FAULT)
> >>
> >>   /*
> >>    * This part is compiled out, it is only here as a build time check
> >> @@ -105,13 +116,13 @@ static inline void bpf_test_buffer_##call(void)                           \
> >>
> >>   #undef DECLARE_TRACE
> >>   #define DECLARE_TRACE(call, proto, args)                               \
> >> -       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))          \
> >> +       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)       \
> >>          __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
> >>
> >>   #undef DECLARE_TRACE_WRITABLE
> >>   #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
> >>          __CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
> >> -       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
> >> +       __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
> >>          __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)
> >>
> >>   #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> >> index c77eb80cbd7f..ed07283d505b 100644
> >> --- a/kernel/trace/bpf_trace.c
> >> +++ b/kernel/trace/bpf_trace.c
> >> @@ -2473,7 +2473,7 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *li
> >>
> >>          return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
> >>                                                      link, TRACEPOINT_DEFAULT_PRIO,
> >> -                                                   TRACEPOINT_MAY_EXIST);
> >> +                                                   TRACEPOINT_MAY_EXIST | (tp->flags & TRACEPOINT_MAY_FAULT));
> >>   }
> >>
> >>   int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link)
> >> --
> >> 2.39.2
> >>
> >>
> >
> > I wonder if it would be better to just do this, instead of that
> > preempt guard. I think we don't strictly need preemption to be
> > disabled, we just need to stay on the same CPU, just like we do that
> > for many other program types.
>
> I'm worried about introducing any kind of subtle synchronization
> change in this series, and moving from preempt-off to migrate-disable
> definitely falls under that umbrella.
>
> I would recommend auditing all uses of this_cpu_*() APIs to make sure
> accesses to per-cpu data structures are using atomics and not just using
> operations that expect use of preempt-off to prevent concurrent threads
> from updating to the per-cpu data concurrently.
>
> So what you are suggesting may be a good idea, but I prefer to leave
> this kind of change to a separate bpf-specific series, and I would
> leave this work to someone who knows more about ebpf than me.
>

Yeah, that's ok. migrate_disable() switch is probably going a bit too
far too fast, but I think we should just add
preempt_disable/preempt_enable inside __bpf_trace_run() instead of
leaving it inside those hard to find and follow tracepoint macros. So
maybe you can just pass a bool into __bpf_trace_run() and do preempt
guard (or explicit disable/enable) there?

> Thanks,
>
> Mathieu
>
> >
> > We'll need some more BPF-specific plumbing to fully support faultable
> > (sleepable) tracepoints, but this should unblock your work, unless I'm
> > missing something. And we can take it from there, once your patches
> > land, to take advantage of faultable tracepoints in the BPF ecosystem.
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index b69a39316c0c..415639b7c7a4 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2302,7 +2302,8 @@ void __bpf_trace_run(struct bpf_raw_tp_link
> > *link, u64 *args)
> >          struct bpf_run_ctx *old_run_ctx;
> >          struct bpf_trace_run_ctx run_ctx;
> >
> > -       cant_sleep();
> > +       migrate_disable();
> > +
> >          if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
> >                  bpf_prog_inc_misses_counter(prog);
> >                  goto out;
> > @@ -2318,6 +2319,8 @@ void __bpf_trace_run(struct bpf_raw_tp_link
> > *link, u64 *args)
> >          bpf_reset_run_ctx(old_run_ctx);
> >   out:
> >          this_cpu_dec(*(prog->active));
> > +
> > +       migrate_enable();
> >   }
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>