On 2024-12-06 12:07, Steven Rostedt wrote:
Hi Mathieu,
Hi Steven,
Sebastian brought up a point at our RT Stable meeting. BPF hooks into tracepoints and can cause long latency on RT setups.
Indeed, as expected if BPF don't have BPF hook duration validations in place.
IIRC, tracepoints themselves do not need to have preemption disabled. It's just that some of the users of tracepoints expect preemption to be disabled.
Correct. Tracepoints need to have some mean of synchronizing callback iteration with the callback registration/unregistration (RCU). Which flavor is used is based on the constraints of the execution contexts in which tracepoints are inserted. Then the fact that tracer probe functions expect that preemption is disabled when called is merely a consequence of the current tracepoint implementation, but this contract between tracepoints and tracers can evolve as needed.
If we fix the users of tracepoints not to expect preemption to be disabled, then we could just switch the preempt_disable code (guard(preempt)) to rcu_read_lock()s for the tracepoint callbacks, right?
There are a few things to consider here about the constraints of the callsites where the tracepoints are inserted. In general, those need to be: - NMI-safe - notrace - usable from the scheduler (with rq lock held) - usable to trace the RCU implementation Hence the use of guard(preempt_notrace)(). So replacing this by a rcu_read_lock() would lose the "notrace", which may break some users. Other than that, I see that the PREEMPT_RCU implementation of rcu_read_lock/unlock works pretty much similarly to the urcu-mb flavor of liburcu: static void rcu_preempt_read_enter(void) { WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1); } static int rcu_preempt_read_exit(void) { int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1; WRITE_ONCE(current->rcu_read_lock_nesting, ret); return ret; } Technically this was designed to be async-signal safe in userspace, so I expect this to work in NMI context. I suspect that the main thing we may be missing here is a rcu_read_lock/unlock_notrace that similarly to our use of preempt_disable/enable_notrace don't call into instrumented code from the instrumentation.
There's a one or two places in ftrace that expect it, but I don't know enough about perf. I don't think BPF needs preemption disabled, but just migration disabled. I know you had some patches to work around this.
Correct, BPF needs migration disabled AFAIU. Perf/ftrace/lttng would have to explicitly disable preemption within their callbacks, but that's easily fixable.
We need to get BPF working without preemption disabled for RT, I'm not sure how much you know about what needs to be fixed.
Well the first step would be to introduce a rcu_read_lock/unlock_notrace. This solves the problem at the tracepoint level, but requires that we initially move the preempt disable to the tracer callbacks. Then we can figure out within each tracer what needs to be done to further reduce the preempt off critical section.
I'm not asking for you to do this work, but can you remind me what you saw when you created the faultable tracepoints?
I saw the future! ;-) Thanks, Mathieu
Thanks, -- Steve
-- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com