On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@xxxxxx> wrote: > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote: > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@xxxxxx> wrote: > > > > > > uprobes work by raising a trap, setting a task flag from within the > > > interrupt handler, and processing the actual work for the uprobe on the > > > way back to userspace. As a result, uprobe handlers already execute in a > > > user context. The primary obstacle to sleepable bpf uprobe programs is > > > therefore on the bpf side. > > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > > > rcu and runs with disabled preemption. In order for uprobe bpf programs > > > to become actually sleepable, we need it to be protected by the tasks_trace > > > rcu flavor instead (and kfree() called after a corresponding grace period). > > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard > > > to reason about. > > > > > > Instead, based on Alexei's proposal, we change the free path for > > > bpf_prog_array to chain a tasks_trace and normal grace periods > > > one after the other. Users who iterate under tasks_trace read section would > > > be safe, as would users who iterate under normal read sections (from > > > non-sleepable locations). The downside is that we take the tasks_trace latency > > > unconditionally but that's deemed acceptable under expected workloads. > > > > One example where this actually can become a problem is cgroup BPF > > programs. There you can make single attachment to root cgroup, but it > > will create one "effective" prog_array for each descendant (and will > > keep destroying and creating them as child cgroups are created). So > > there is this invisible multiplier which we can't really control. > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in > > bpf_prog_array_free() which is used for purely non-sleepable cases > > seems unfortunate. But I think an alternative to tracking this "has > > sleepable" bit on a per bpf_prog_array case is to have > > bpf_prog_array_free_sleepable() implementation independent of > > bpf_prog_array_free() itself and call that sleepable variant from > > uprobe detach handler, limiting the impact to things that actually > > might be running as sleepable and which most likely won't churn > > through a huge amount of arrays. WDYT? > > Honestly, I don't like the idea of having two different APIs, where if you use the > wrong one, the program would only fail in rare and undebuggable circumstances. > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup > creation that we can sustain?), we should track that in the bpf_prog_array itself. We > can have the allocation path set a flag and branch on it in free() to determine the > necessary grace periods. I think what Andrii is proposing is to leave bpf_prog_array_free() as-is and introduce new bpf_prog_array_free_sleepable() (the way it is in this patch) and call it from 2 places in kernel/trace/bpf_trace.c only. This way cgroup won't be affected. The rcu_trace protection here applies to prog_array itself. Normal progs are still rcu, sleepable progs are rcu_trace. Regardless whether they're called via trampoline or this new prog_array.