Re: [PATCH v2 bpf-next 3/7] bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 23 Aug 2023 19:09:40 -0700

On Wed, Aug 23, 2023 at 6:38 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >
> >
> >
> > On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
> > > On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> > >>
> > >>
> > >>
> > >> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> > >>> This is the final fix for the use-after-free scenario described in
> > >>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> > >>> non-owning refs"). That commit, by virtue of changing
> > >>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> > >>> the "refcount incr on 0" splat. The not_zero check in
> > >>> refcount_inc_not_zero, though, still occurs on memory that could have
> > >>> been free'd and reused, so the commit didn't properly fix the root
> > >>> cause.
> > >>>
> > >>> This patch actually fixes the issue by free'ing using the recently-added
> > >>> bpf_mem_free_rcu, which ensures that the memory is not reused until
> > >>> RCU grace period has elapsed. If that has happened then
> > >>> there are no non-owning references alive that point to the
> > >>> recently-free'd memory, so it can be safely reused.
> > >>>
> > >>> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
> > >>> Acked-by: Yonghong Song <yonghong.song@xxxxxxxxx>
> > >>> ---
> > >>>    kernel/bpf/helpers.c | 6 +++++-
> > >>>    1 file changed, 5 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > >>> index eb91cae0612a..945a85e25ac5 100644
> > >>> --- a/kernel/bpf/helpers.c
> > >>> +++ b/kernel/bpf/helpers.c
> > >>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
> > >>>
> > >>>        if (rec)
> > >>>                bpf_obj_free_fields(rec, p);
> > >>
> > >> During reviewing my percpu kptr patch with link
> > >>
> > >> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@xxxxxxxxx/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
> > >> Kumar mentioned although percpu memory itself is freed under rcu.
> > >> But its record fields are freed immediately. This will cause
> > >> the problem since there may be some active uses of these fields
> > >> within rcu cs and after bpf_obj_free_fields(), some fields may
> > >> be re-initialized with new memory but they do not have chances
> > >> to free any more.
> > >>
> > >> Do we have problem here as well?
> > >
> > > I think it's not an issue here or in your percpu patch,
> > > since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
> > > call bpf_mem_free_rcu() (after this patch set lands).
> >
> > The following is my understanding.
> >
> > void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> > {
> >          const struct btf_field *fields;
> >          int i;
> >
> >          if (IS_ERR_OR_NULL(rec))
> >                  return;
> >          fields = rec->fields;
> >          for (i = 0; i < rec->cnt; i++) {
> >                  struct btf_struct_meta *pointee_struct_meta;
> >                  const struct btf_field *field = &fields[i];
> >                  void *field_ptr = obj + field->offset;
> >                  void *xchgd_field;
> >
> >                  switch (fields[i].type) {
> >                  case BPF_SPIN_LOCK:
> >                          break;
> >                  case BPF_TIMER:
> >                          bpf_timer_cancel_and_free(field_ptr);
> >                          break;
> >                  case BPF_KPTR_UNREF:
> >                          WRITE_ONCE(*(u64 *)field_ptr, 0);
> >                          break;
> >                  case BPF_KPTR_REF:
> >                         ......
> >                          break;
> >                  case BPF_LIST_HEAD:
> >                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> >                                  continue;
> >                          bpf_list_head_free(field, field_ptr, obj +
> > rec->spin_lock_off);
> >                          break;
> >                  case BPF_RB_ROOT:
> >                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> >                                  continue;
> >                          bpf_rb_root_free(field, field_ptr, obj +
> > rec->spin_lock_off);
> >                          break;
> >                  case BPF_LIST_NODE:
> >                  case BPF_RB_NODE:
> >                  case BPF_REFCOUNT:
> >                          break;
> >                  default:
> >                          WARN_ON_ONCE(1);
> >                          continue;
> >                  }
> >          }
> > }
> >
> > For percpu kptr, the remaining possible actiionable fields are
> >         BPF_LIST_HEAD and BPF_RB_ROOT
> >
> > So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
> > list/rb nodes to unlink them from the list_head/rb_root.
> >
> > So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
> > Depending on whether the correspondingrec
> > with rb_node/list_node has ref count or not,
> > it may call bpf_mem_free() or bpf_mem_free_rcu(). If
> > bpf_mem_free() is called, then the field is immediately freed
> > but it may be used by some bpf prog (under rcu) concurrently,
> > could this be an issue?
>
> I see. Yeah. Looks like percpu makes such fields refcount-like.
> For non-percpu non-refcount only one bpf prog on one cpu can observe
> that object. That's why we're doing plain bpf_mem_free() for them.
>
> So this patch is a good fix for refcounted, but you and Kumar are
> correct that it's not sufficient for the case when percpu struct
> includes multiple rb_roots. One for each cpu.
>
> > Changing bpf_mem_free() in
> > __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.
>
> I guess we can do that when obj is either refcount or can be
> insider percpu, but it might not be enough. More below.
>
> > Another thing is related to list_head/rb_root.
> > During bpf_obj_free_fields(), is it possible that another cpu
> > may allocate a list_node/rb_node and add to list_head/rb_root?
>
> It's not an issue for the single owner case and for refcounted.
> Access to rb_root/list_head is always lock protected.
> For refcounted the obj needs to be acquired (from the verifier pov)
> meaning to have refcount =1 to be able to do spin_lock and
> operate on list_head.
>
> But bpf_rb_root_free is indeed an issue for percpu, since each
> percpu has its own rb root field with its own bpf_spinlock, but
> for_each_cpu() {bpf_obj_free_fields();} violates access contract.
>
> percpu and rb_root creates such a maze of dependencies that
> I think it's better to disallow rb_root-s and kptr-s inside percpu
> for now.
>
> > If this is true, then we might have a memory leak.
> > But I don't whether this is possible or not.
> >
> > I think local kptr has the issue as percpu kptr.
>
> Let's tackle one at a time.
> I still think Dave's patch set is a good fix for recounted,
> while we need to think more about percpu case.

I'm planning to land the series though there could be still bugs to fix.
At least they're fixing known issues.
Please yell if you think we have to wait for another release.

Like I think the following is needed regardless:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bb78212fa5b2..efa6482b1c2c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7210,6 +7210,10 @@ static int process_spin_lock(struct
bpf_verifier_env *env, int regno,
                        map ? map->name : "kptr");
                return -EINVAL;
        }
+       if (type_is_non_owning_ref(reg->type)) {
+               verbose(env, "Accessing locks in non-owning object is
not allowed\n");
+               return -EINVAL;
+       }
        if (rec->spin_lock_off != val + reg->off) {
                verbose(env, "off %lld doesn't point to 'struct
bpf_spin_lock' that is at %d\n",

atm I don't see where we enforce such.
Since refcounted non-own refs can have refcount=0 it's not safe to access
non-scalar objects inside them.
Maybe stronger enforcement is needed?