Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu

Delyan Kratunov <delyank@xxxxxx> · Thu, 28 Apr 2022 19:15:04 +0000

On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@xxxxxx> wrote:
> > 
> > uprobes work by raising a trap, setting a task flag from within the
> > interrupt handler, and processing the actual work for the uprobe on the
> > way back to userspace. As a result, uprobe handlers already execute in a
> > user context. The primary obstacle to sleepable bpf uprobe programs is
> > therefore on the bpf side.
> > 
> > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > to become actually sleepable, we need it to be protected by the tasks_trace
> > rcu flavor instead (and kfree() called after a corresponding grace period).
> > 
> > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > to reason about.
> > 
> > Instead, based on Alexei's proposal, we change the free path for
> > bpf_prog_array to chain a tasks_trace and normal grace periods
> > one after the other. Users who iterate under tasks_trace read section would
> > be safe, as would users who iterate under normal read sections (from
> > non-sleepable locations). The downside is that we take the tasks_trace latency
> > unconditionally but that's deemed acceptable under expected workloads.
> 
> One example where this actually can become a problem is cgroup BPF
> programs. There you can make single attachment to root cgroup, but it
> will create one "effective" prog_array for each descendant (and will
> keep destroying and creating them as child cgroups are created). So
> there is this invisible multiplier which we can't really control.
> 
> So paying the (however small, but) price of call_rcu_tasks_trace() in
> bpf_prog_array_free() which is used for purely non-sleepable cases
> seems unfortunate. But I think an alternative to tracking this "has
> sleepable" bit on a per bpf_prog_array case is to have
> bpf_prog_array_free_sleepable() implementation independent of
> bpf_prog_array_free() itself and call that sleepable variant from
> uprobe detach handler, limiting the impact to things that actually
> might be running as sleepable and which most likely won't churn
> through a huge amount of arrays. WDYT?

Honestly, I don't like the idea of having two different APIs, where if you use the
wrong one, the program would only fail in rare and undebuggable circumstances. 

If we need specialization (and I'm not convinced we do - what's the rate of cgroup
creation that we can sustain?), we should track that in the bpf_prog_array itself. We
can have the allocation path set a flag and branch on it in free() to determine the
necessary grace periods.

> 
> Otherwise all looks good and surprisingly straightforward thanks to
> the fact uprobe is already running in a sleepable context, awesome!
> 
> > 
> > The other interesting implication is wrt non-sleepable uprobe
> > programs. Because they need access to dynamically sized rcu-protected
> > maps, we conditionally disable preemption and take an rcu read section
> > around them, in addition to the overarching tasks_trace section.
> > 
> > Signed-off-by: Delyan Kratunov <delyank@xxxxxx>
> > ---
> >  include/linux/bpf.h          | 60 ++++++++++++++++++++++++++++++++++++
> >  include/linux/trace_events.h |  1 +
> >  kernel/bpf/core.c            | 10 +++++-
> >  kernel/trace/bpf_trace.c     | 23 ++++++++++++++
> >  kernel/trace/trace_uprobe.c  |  4 +--
> >  5 files changed, 94 insertions(+), 4 deletions(-)
> > 
> 
> [...]