Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 28 Apr 2022 11:19:01 -0700

On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@xxxxxx> wrote:
>
> uprobes work by raising a trap, setting a task flag from within the
> interrupt handler, and processing the actual work for the uprobe on the
> way back to userspace. As a result, uprobe handlers already execute in a
> user context. The primary obstacle to sleepable bpf uprobe programs is
> therefore on the bpf side.
>
> Namely, the bpf_prog_array attached to the uprobe is protected by normal
> rcu and runs with disabled preemption. In order for uprobe bpf programs
> to become actually sleepable, we need it to be protected by the tasks_trace
> rcu flavor instead (and kfree() called after a corresponding grace period).
>
> One way to achieve this is by tracking an array-has-contained-sleepable-prog
> flag in bpf_prog_array and switching rcu flavors based on it. However, this
> is deemed somewhat unwieldly and the rcu flavor transition would be hard
> to reason about.
>
> Instead, based on Alexei's proposal, we change the free path for
> bpf_prog_array to chain a tasks_trace and normal grace periods
> one after the other. Users who iterate under tasks_trace read section would
> be safe, as would users who iterate under normal read sections (from
> non-sleepable locations). The downside is that we take the tasks_trace latency
> unconditionally but that's deemed acceptable under expected workloads.

One example where this actually can become a problem is cgroup BPF
programs. There you can make single attachment to root cgroup, but it
will create one "effective" prog_array for each descendant (and will
keep destroying and creating them as child cgroups are created). So
there is this invisible multiplier which we can't really control.

So paying the (however small, but) price of call_rcu_tasks_trace() in
bpf_prog_array_free() which is used for purely non-sleepable cases
seems unfortunate. But I think an alternative to tracking this "has
sleepable" bit on a per bpf_prog_array case is to have
bpf_prog_array_free_sleepable() implementation independent of
bpf_prog_array_free() itself and call that sleepable variant from
uprobe detach handler, limiting the impact to things that actually
might be running as sleepable and which most likely won't churn
through a huge amount of arrays. WDYT?

Otherwise all looks good and surprisingly straightforward thanks to
the fact uprobe is already running in a sleepable context, awesome!

>
> The other interesting implication is wrt non-sleepable uprobe
> programs. Because they need access to dynamically sized rcu-protected
> maps, we conditionally disable preemption and take an rcu read section
> around them, in addition to the overarching tasks_trace section.
>
> Signed-off-by: Delyan Kratunov <delyank@xxxxxx>
> ---
>  include/linux/bpf.h          | 60 ++++++++++++++++++++++++++++++++++++
>  include/linux/trace_events.h |  1 +
>  kernel/bpf/core.c            | 10 +++++-
>  kernel/trace/bpf_trace.c     | 23 ++++++++++++++
>  kernel/trace/trace_uprobe.c  |  4 +--
>  5 files changed, 94 insertions(+), 4 deletions(-)
>

[...]