Re: [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 25 Mar 2024 20:15:08 -0700

On Mon, Mar 25, 2024 at 7:59 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Mar 25, 2024 at 7:16 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Mar 25, 2024 at 2:51 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > >
> > > BPF link for some program types is passed as a "context" which can be
> > > used by those BPF programs to look up additional information. E.g., for
> > > BPF raw tracepoints, link is used to fetch BPF cookie value, similarly
> > > for BPF multi-kprobes and multi-uprobes.
> > >
> > > Because of this runtime dependency, when bpf_link refcnt drops to zero
> > > that could be still active BPF programs running accessing link data
> > > (cookie, program pointer, etc).
> > >
> > > This patch accommodates this by delaying freeing memory to after RCU GP,
> > > which will fix BPF raw tp, multi-kprobe, and non-sleepable multi-uprobe.
> > >
> > > Perhaps a better approach would be to have a per-link flag specifying
> > > desired behavior: no delay, RCU delay, or task_trace RCU delay? So
> > > sending this as an RFC fix to discuss desired final solution.
> > >
> > > Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint")
> > > Reported-by: syzbot+981935d9485a560bfbcb@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Reported-by: syzbot+2cb5a6c573e98db598cc@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Reported-by: syzbot+62d8b26793e8a2bd0516@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > ---
> > >  include/linux/bpf.h  |  8 +++++++-
> > >  kernel/bpf/syscall.c | 12 ++++++++++--
> > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 62762390c93d..d73a8978c800 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1573,7 +1573,13 @@ struct bpf_link {
> > >         enum bpf_link_type type;
> > >         const struct bpf_link_ops *ops;
> > >         struct bpf_prog *prog;
> > > -       struct work_struct work;
> > > +       /* rcu is used before freeing, work can be used to schedule that
> > > +        * RCU-based freeing before that, so they never overlap
> > > +        */
> > > +       union {
> > > +               struct rcu_head rcu;
> > > +               struct work_struct work;
> > > +       };
> > >  };
> > >
> > >  struct bpf_link_ops {
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index e44c276e8617..af1591af10bb 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -3024,6 +3024,14 @@ void bpf_link_inc(struct bpf_link *link)
> > >         atomic64_inc(&link->refcnt);
> > >  }
> > >
> > > +static void bpf_link_dealloc_deferred(struct rcu_head *rcu)
> > > +{
> > > +       struct bpf_link *link = container_of(rcu, struct bpf_link, rcu);
> > > +
> > > +       /* free bpf_link and its containing memory */
> > > +       link->ops->dealloc(link);
> > > +}
> > > +
> > >  /* bpf_link_free is guaranteed to be called from process context */
> > >  static void bpf_link_free(struct bpf_link *link)
> > >  {
> > > @@ -3033,8 +3041,8 @@ static void bpf_link_free(struct bpf_link *link)
> > >                 link->ops->release(link);
> > >                 bpf_prog_put(link->prog);
> > >         }
> > > -       /* free bpf_link and its containing memory */
> > > -       link->ops->dealloc(link);
> > > +       /* schedule BPF link deallocation after RCU grace period */
> > > +       call_rcu(&link->rcu, bpf_link_dealloc_deferred);
> >
> > Do we have any sleepable progs that access link as well?
> > If so then call_rcu_tasks_trace ?
>
> yep, multi-uprobe, can be sleepable
>
> > and check rcu_trace_implies_rcu_gp to avoid extra call_rcu.
>
> yep, saw that for maps
>
> >
> > Doing that for all link types like networking that don't look
> > into link is not great. Probably needs to look into link type?
> > Or look into prog->type and do the extra work only for raw_tp,
> > multi-[ku]probe ?
>
> so I was thinking that bpf_link_init() can accept an enum or flags to
> specify what needs to be done, instead of trying to guess/derive that.
> That seems less error-prone, more explicit, easier to audit. WDYT?

yeah. makes sense.

> This should work ok given this is normally dependent on program type,
> the only exception (that we currently don't have) would be if
> LINK_UPDATE command replaces non-sleepable BPF program (say uprobe)
> with sleepable one. But we don't support that for kprobe/uprobe and
> other tracing programs, so it's a future problem (but we'll be able to
> support it, when we get to this).

that would be weird. future problem indeed.