On Mon, Mar 25, 2024 at 7:59 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Mon, Mar 25, 2024 at 7:16 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Mon, Mar 25, 2024 at 2:51 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > BPF link for some program types is passed as a "context" which can be > > > used by those BPF programs to look up additional information. E.g., for > > > BPF raw tracepoints, link is used to fetch BPF cookie value, similarly > > > for BPF multi-kprobes and multi-uprobes. > > > > > > Because of this runtime dependency, when bpf_link refcnt drops to zero > > > that could be still active BPF programs running accessing link data > > > (cookie, program pointer, etc). > > > > > > This patch accommodates this by delaying freeing memory to after RCU GP, > > > which will fix BPF raw tp, multi-kprobe, and non-sleepable multi-uprobe. > > > > > > Perhaps a better approach would be to have a per-link flag specifying > > > desired behavior: no delay, RCU delay, or task_trace RCU delay? So > > > sending this as an RFC fix to discuss desired final solution. > > > > > > Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint") > > > Reported-by: syzbot+981935d9485a560bfbcb@xxxxxxxxxxxxxxxxxxxxxxxxx > > > Reported-by: syzbot+2cb5a6c573e98db598cc@xxxxxxxxxxxxxxxxxxxxxxxxx > > > Reported-by: syzbot+62d8b26793e8a2bd0516@xxxxxxxxxxxxxxxxxxxxxxxxx > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > --- > > > include/linux/bpf.h | 8 +++++++- > > > kernel/bpf/syscall.c | 12 ++++++++++-- > > > 2 files changed, 17 insertions(+), 3 deletions(-) > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > index 62762390c93d..d73a8978c800 100644 > > > --- a/include/linux/bpf.h > > > +++ b/include/linux/bpf.h > > > @@ -1573,7 +1573,13 @@ struct bpf_link { > > > enum bpf_link_type type; > > > const struct bpf_link_ops *ops; > > > struct bpf_prog *prog; > > > - struct work_struct work; > > > + /* rcu is used before freeing, work can be used to schedule that > > > + * RCU-based freeing before that, so they never overlap > > > + */ > > > + union { > > > + struct rcu_head rcu; > > > + struct work_struct work; > > > + }; > > > }; > > > > > > struct bpf_link_ops { > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index e44c276e8617..af1591af10bb 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -3024,6 +3024,14 @@ void bpf_link_inc(struct bpf_link *link) > > > atomic64_inc(&link->refcnt); > > > } > > > > > > +static void bpf_link_dealloc_deferred(struct rcu_head *rcu) > > > +{ > > > + struct bpf_link *link = container_of(rcu, struct bpf_link, rcu); > > > + > > > + /* free bpf_link and its containing memory */ > > > + link->ops->dealloc(link); > > > +} > > > + > > > /* bpf_link_free is guaranteed to be called from process context */ > > > static void bpf_link_free(struct bpf_link *link) > > > { > > > @@ -3033,8 +3041,8 @@ static void bpf_link_free(struct bpf_link *link) > > > link->ops->release(link); > > > bpf_prog_put(link->prog); > > > } > > > - /* free bpf_link and its containing memory */ > > > - link->ops->dealloc(link); > > > + /* schedule BPF link deallocation after RCU grace period */ > > > + call_rcu(&link->rcu, bpf_link_dealloc_deferred); > > > > Do we have any sleepable progs that access link as well? > > If so then call_rcu_tasks_trace ? > > yep, multi-uprobe, can be sleepable > > > and check rcu_trace_implies_rcu_gp to avoid extra call_rcu. > > yep, saw that for maps > > > > > Doing that for all link types like networking that don't look > > into link is not great. Probably needs to look into link type? > > Or look into prog->type and do the extra work only for raw_tp, > > multi-[ku]probe ? > > so I was thinking that bpf_link_init() can accept an enum or flags to > specify what needs to be done, instead of trying to guess/derive that. > That seems less error-prone, more explicit, easier to audit. WDYT? yeah. makes sense. > This should work ok given this is normally dependent on program type, > the only exception (that we currently don't have) would be if > LINK_UPDATE command replaces non-sleepable BPF program (say uprobe) > with sleepable one. But we don't support that for kprobe/uprobe and > other tracing programs, so it's a future problem (but we'll be able to > support it, when we get to this). that would be weird. future problem indeed.