Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu

Delyan Kratunov <delyank@xxxxxx> · Mon, 2 May 2022 17:31:01 +0000

On Thu, 2022-04-28 at 15:52 -0700, Alexei Starovoitov wrote:
> On Thu, Apr 28, 2022 at 2:35 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> > 
> > On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote:
> > > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> > > > 
> > > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> > > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@xxxxxx> wrote:
> > > > > > 
> > > > > > uprobes work by raising a trap, setting a task flag from within the
> > > > > > interrupt handler, and processing the actual work for the uprobe on the
> > > > > > way back to userspace. As a result, uprobe handlers already execute in a
> > > > > > user context. The primary obstacle to sleepable bpf uprobe programs is
> > > > > > therefore on the bpf side.
> > > > > > 
> > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > > > > > to become actually sleepable, we need it to be protected by the tasks_trace
> > > > > > rcu flavor instead (and kfree() called after a corresponding grace period).
> > > > > > 
> > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > > > > > to reason about.
> > > > > > 
> > > > > > Instead, based on Alexei's proposal, we change the free path for
> > > > > > bpf_prog_array to chain a tasks_trace and normal grace periods
> > > > > > one after the other. Users who iterate under tasks_trace read section would
> > > > > > be safe, as would users who iterate under normal read sections (from
> > > > > > non-sleepable locations). The downside is that we take the tasks_trace latency
> > > > > > unconditionally but that's deemed acceptable under expected workloads.
> > > > > 
> > > > > One example where this actually can become a problem is cgroup BPF
> > > > > programs. There you can make single attachment to root cgroup, but it
> > > > > will create one "effective" prog_array for each descendant (and will
> > > > > keep destroying and creating them as child cgroups are created). So
> > > > > there is this invisible multiplier which we can't really control.
> > > > > 
> > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in
> > > > > bpf_prog_array_free() which is used for purely non-sleepable cases
> > > > > seems unfortunate. But I think an alternative to tracking this "has
> > > > > sleepable" bit on a per bpf_prog_array case is to have
> > > > > bpf_prog_array_free_sleepable() implementation independent of
> > > > > bpf_prog_array_free() itself and call that sleepable variant from
> > > > > uprobe detach handler, limiting the impact to things that actually
> > > > > might be running as sleepable and which most likely won't churn
> > > > > through a huge amount of arrays. WDYT?
> > > > 
> > > > Honestly, I don't like the idea of having two different APIs, where if you use the
> > > > wrong one, the program would only fail in rare and undebuggable circumstances.
> > > > 
> > > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup
> > > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We
> > > > can have the allocation path set a flag and branch on it in free() to determine the
> > > > necessary grace periods.
> > > 
> > > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is
> > > and introduce new bpf_prog_array_free_sleepable() (the way it is in
> > > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only.
> > > This way cgroup won't be affected.
> > > 
> > > The rcu_trace protection here applies to prog_array itself.
> > > Normal progs are still rcu, sleepable progs are rcu_trace.
> > > Regardless whether they're called via trampoline or this new prog_array.
> > 
> > Right, I understand the proposal. My objection is that if tomorrow someone mistakenly
> > keeps using bpf_prog_array_free when they actually need
> > bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and
> > reason about. I can make it correct in this patch series but *keeping* things correct
> > going forward will be harder when there's two free variants.
> 
> This kind of mistakes code reviews are supposed to catch.

You definitely trust code review more than I do! :) 

> > Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace
> > periods inside bpf_prog_array_free. That way we eliminate potential future confusion
> > and instead of introducing subtle rcu bugs, at worst you can incur a performance
> > penalty by using the flag where you don't need it. If we spend the time to one-way
> > flip the flag only when you actually insert a sleepable program into the array, even
> > that penalty is eliminated.
> 
> Are you suggesting to add such flag automatically when
> sleepable bpf progs are added to prog_array?
> That would not be correct.
> Presence of sleepable progs has nothing to do with prog_array itself.
> The bpf_prog_array_free[_sleepable]() frees that array.
> Not the progs inside it.

The only reason the array is under the tasks_trace rcu is the presence of sleepable
programs inside, so I'd say it is related. I'm reasonably certain we could
orchestrate the transition from one rcu flavor to the other safely. Alternatively, we
can have two rcu_heads and only take the tasks_trace one conditionally.

> If you mean adding this flag when prog_array is allocated
> for uprobe attachment then I don't see how it helps code reviews.
> Nothing automatic here. It's one place and single purpose flag.
> Instead of dynamically checking it in free.
> We can directly call the correct function.

Sure, I'll send a v2 with free_sleepable later today.