On Mon, Sep 9, 2024 at 8:11 AM Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > > On 2024-09-04 21:21, Andrii Nakryiko wrote: > > On Wed, Aug 28, 2024 at 7:42 AM Mathieu Desnoyers > > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > >> > >> In preparation for converting system call enter/exit instrumentation > >> into faultable tracepoints, make sure that bpf can handle registering to > >> such tracepoints by explicitly disabling preemption within the bpf > >> tracepoint probes to respect the current expectations within bpf tracing > >> code. > >> > >> This change does not yet allow bpf to take page faults per se within its > >> probe, but allows its existing probes to connect to faultable > >> tracepoints. > >> > >> Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoyers@xxxxxxxxxxxx/ > >> Co-developed-by: Michael Jeanson <mjeanson@xxxxxxxxxxxx> > >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > >> Signed-off-by: Michael Jeanson <mjeanson@xxxxxxxxxxxx> > >> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx> > >> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> > >> Cc: Masami Hiramatsu <mhiramat@xxxxxxxxxx> > >> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > >> Cc: Alexei Starovoitov <ast@xxxxxxxxxx> > >> Cc: Yonghong Song <yhs@xxxxxx> > >> Cc: Paul E. McKenney <paulmck@xxxxxxxxxx> > >> Cc: Ingo Molnar <mingo@xxxxxxxxxx> > >> Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> > >> Cc: Mark Rutland <mark.rutland@xxxxxxx> > >> Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx> > >> Cc: Namhyung Kim <namhyung@xxxxxxxxxx> > >> Cc: bpf@xxxxxxxxxxxxxxx > >> Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> > >> --- > >> Changes since v4: > >> - Use DEFINE_INACTIVE_GUARD. > >> - Add brackets to multiline 'if' statements. > >> Changes since v5: > >> - Rebased on v6.11-rc5. > >> - Pass the TRACEPOINT_MAY_FAULT flag directly to tracepoint_probe_register_prio_flags. > >> --- > >> include/trace/bpf_probe.h | 21 ++++++++++++++++----- > >> kernel/trace/bpf_trace.c | 2 +- > >> 2 files changed, 17 insertions(+), 6 deletions(-) > >> > >> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h > >> index a2ea11cc912e..cc96dd1e7c3d 100644 > >> --- a/include/trace/bpf_probe.h > >> +++ b/include/trace/bpf_probe.h > >> @@ -42,16 +42,27 @@ > >> /* tracepoints with more than 12 arguments will hit build error */ > >> #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__) > >> > >> -#define __BPF_DECLARE_TRACE(call, proto, args) \ > >> +#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags) \ > >> static notrace void \ > >> __bpf_trace_##call(void *__data, proto) \ > >> { \ > >> - CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ > >> + DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard); \ > >> + \ > >> + if ((tp_flags) & TRACEPOINT_MAY_FAULT) { \ > >> + might_fault(); \ > >> + activate_guard(preempt_notrace, bpf_trace_guard)(); \ > >> + } \ > >> + \ > >> + CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ > >> } > >> > >> #undef DECLARE_EVENT_CLASS > >> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ > >> - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) > >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) > >> + > >> +#undef DECLARE_EVENT_CLASS_MAY_FAULT > >> +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ > >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), TRACEPOINT_MAY_FAULT) > >> > >> /* > >> * This part is compiled out, it is only here as a build time check > >> @@ -105,13 +116,13 @@ static inline void bpf_test_buffer_##call(void) \ > >> > >> #undef DECLARE_TRACE > >> #define DECLARE_TRACE(call, proto, args) \ > >> - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ > >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \ > >> __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0) > >> > >> #undef DECLARE_TRACE_WRITABLE > >> #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \ > >> __CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \ > >> - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ > >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \ > >> __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size) > >> > >> #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) > >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > >> index c77eb80cbd7f..ed07283d505b 100644 > >> --- a/kernel/trace/bpf_trace.c > >> +++ b/kernel/trace/bpf_trace.c > >> @@ -2473,7 +2473,7 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *li > >> > >> return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func, > >> link, TRACEPOINT_DEFAULT_PRIO, > >> - TRACEPOINT_MAY_EXIST); > >> + TRACEPOINT_MAY_EXIST | (tp->flags & TRACEPOINT_MAY_FAULT)); > >> } > >> > >> int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link) > >> -- > >> 2.39.2 > >> > >> > > > > I wonder if it would be better to just do this, instead of that > > preempt guard. I think we don't strictly need preemption to be > > disabled, we just need to stay on the same CPU, just like we do that > > for many other program types. > > I'm worried about introducing any kind of subtle synchronization > change in this series, and moving from preempt-off to migrate-disable > definitely falls under that umbrella. > > I would recommend auditing all uses of this_cpu_*() APIs to make sure > accesses to per-cpu data structures are using atomics and not just using > operations that expect use of preempt-off to prevent concurrent threads > from updating to the per-cpu data concurrently. > > So what you are suggesting may be a good idea, but I prefer to leave > this kind of change to a separate bpf-specific series, and I would > leave this work to someone who knows more about ebpf than me. > Yeah, that's ok. migrate_disable() switch is probably going a bit too far too fast, but I think we should just add preempt_disable/preempt_enable inside __bpf_trace_run() instead of leaving it inside those hard to find and follow tracepoint macros. So maybe you can just pass a bool into __bpf_trace_run() and do preempt guard (or explicit disable/enable) there? > Thanks, > > Mathieu > > > > > We'll need some more BPF-specific plumbing to fully support faultable > > (sleepable) tracepoints, but this should unblock your work, unless I'm > > missing something. And we can take it from there, once your patches > > land, to take advantage of faultable tracepoints in the BPF ecosystem. > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index b69a39316c0c..415639b7c7a4 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -2302,7 +2302,8 @@ void __bpf_trace_run(struct bpf_raw_tp_link > > *link, u64 *args) > > struct bpf_run_ctx *old_run_ctx; > > struct bpf_trace_run_ctx run_ctx; > > > > - cant_sleep(); > > + migrate_disable(); > > + > > if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { > > bpf_prog_inc_misses_counter(prog); > > goto out; > > @@ -2318,6 +2319,8 @@ void __bpf_trace_run(struct bpf_raw_tp_link > > *link, u64 *args) > > bpf_reset_run_ctx(old_run_ctx); > > out: > > this_cpu_dec(*(prog->active)); > > + > > + migrate_enable(); > > } > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >