Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 28 Apr 2022 15:52:56 -0700

On Thu, Apr 28, 2022 at 2:35 PM Delyan Kratunov <delyank@xxxxxx> wrote:
>
> On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote:
> > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> > >
> > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@xxxxxx> wrote:
> > > > >
> > > > > uprobes work by raising a trap, setting a task flag from within the
> > > > > interrupt handler, and processing the actual work for the uprobe on the
> > > > > way back to userspace. As a result, uprobe handlers already execute in a
> > > > > user context. The primary obstacle to sleepable bpf uprobe programs is
> > > > > therefore on the bpf side.
> > > > >
> > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > > > > to become actually sleepable, we need it to be protected by the tasks_trace
> > > > > rcu flavor instead (and kfree() called after a corresponding grace period).
> > > > >
> > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > > > > to reason about.
> > > > >
> > > > > Instead, based on Alexei's proposal, we change the free path for
> > > > > bpf_prog_array to chain a tasks_trace and normal grace periods
> > > > > one after the other. Users who iterate under tasks_trace read section would
> > > > > be safe, as would users who iterate under normal read sections (from
> > > > > non-sleepable locations). The downside is that we take the tasks_trace latency
> > > > > unconditionally but that's deemed acceptable under expected workloads.
> > > >
> > > > One example where this actually can become a problem is cgroup BPF
> > > > programs. There you can make single attachment to root cgroup, but it
> > > > will create one "effective" prog_array for each descendant (and will
> > > > keep destroying and creating them as child cgroups are created). So
> > > > there is this invisible multiplier which we can't really control.
> > > >
> > > > So paying the (however small, but) price of call_rcu_tasks_trace() in
> > > > bpf_prog_array_free() which is used for purely non-sleepable cases
> > > > seems unfortunate. But I think an alternative to tracking this "has
> > > > sleepable" bit on a per bpf_prog_array case is to have
> > > > bpf_prog_array_free_sleepable() implementation independent of
> > > > bpf_prog_array_free() itself and call that sleepable variant from
> > > > uprobe detach handler, limiting the impact to things that actually
> > > > might be running as sleepable and which most likely won't churn
> > > > through a huge amount of arrays. WDYT?
> > >
> > > Honestly, I don't like the idea of having two different APIs, where if you use the
> > > wrong one, the program would only fail in rare and undebuggable circumstances.
> > >
> > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup
> > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We
> > > can have the allocation path set a flag and branch on it in free() to determine the
> > > necessary grace periods.
> >
> > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is
> > and introduce new bpf_prog_array_free_sleepable() (the way it is in
> > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only.
> > This way cgroup won't be affected.
> >
> > The rcu_trace protection here applies to prog_array itself.
> > Normal progs are still rcu, sleepable progs are rcu_trace.
> > Regardless whether they're called via trampoline or this new prog_array.
>
> Right, I understand the proposal. My objection is that if tomorrow someone mistakenly
> keeps using bpf_prog_array_free when they actually need
> bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and
> reason about. I can make it correct in this patch series but *keeping* things correct
> going forward will be harder when there's two free variants.

This kind of mistakes code reviews are supposed to catch.

> Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace
> periods inside bpf_prog_array_free. That way we eliminate potential future confusion
> and instead of introducing subtle rcu bugs, at worst you can incur a performance
> penalty by using the flag where you don't need it. If we spend the time to one-way
> flip the flag only when you actually insert a sleepable program into the array, even
> that penalty is eliminated.

Are you suggesting to add such flag automatically when
sleepable bpf progs are added to prog_array?
That would not be correct.
Presence of sleepable progs has nothing to do with prog_array itself.
The bpf_prog_array_free[_sleepable]() frees that array.
Not the progs inside it.

If you mean adding this flag when prog_array is allocated
for uprobe attachment then I don't see how it helps code reviews.
Nothing automatic here. It's one place and single purpose flag.
Instead of dynamically checking it in free.
We can directly call the correct function.