Re: [PATCH v2 bpf-next 3/7] bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 23 Aug 2023 18:38:08 -0700

On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
>
> On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
> > On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> >>> This is the final fix for the use-after-free scenario described in
> >>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> >>> non-owning refs"). That commit, by virtue of changing
> >>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> >>> the "refcount incr on 0" splat. The not_zero check in
> >>> refcount_inc_not_zero, though, still occurs on memory that could have
> >>> been free'd and reused, so the commit didn't properly fix the root
> >>> cause.
> >>>
> >>> This patch actually fixes the issue by free'ing using the recently-added
> >>> bpf_mem_free_rcu, which ensures that the memory is not reused until
> >>> RCU grace period has elapsed. If that has happened then
> >>> there are no non-owning references alive that point to the
> >>> recently-free'd memory, so it can be safely reused.
> >>>
> >>> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
> >>> Acked-by: Yonghong Song <yonghong.song@xxxxxxxxx>
> >>> ---
> >>>    kernel/bpf/helpers.c | 6 +++++-
> >>>    1 file changed, 5 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >>> index eb91cae0612a..945a85e25ac5 100644
> >>> --- a/kernel/bpf/helpers.c
> >>> +++ b/kernel/bpf/helpers.c
> >>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
> >>>
> >>>        if (rec)
> >>>                bpf_obj_free_fields(rec, p);
> >>
> >> During reviewing my percpu kptr patch with link
> >>
> >> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@xxxxxxxxx/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
> >> Kumar mentioned although percpu memory itself is freed under rcu.
> >> But its record fields are freed immediately. This will cause
> >> the problem since there may be some active uses of these fields
> >> within rcu cs and after bpf_obj_free_fields(), some fields may
> >> be re-initialized with new memory but they do not have chances
> >> to free any more.
> >>
> >> Do we have problem here as well?
> >
> > I think it's not an issue here or in your percpu patch,
> > since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
> > call bpf_mem_free_rcu() (after this patch set lands).
>
> The following is my understanding.
>
> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> {
>          const struct btf_field *fields;
>          int i;
>
>          if (IS_ERR_OR_NULL(rec))
>                  return;
>          fields = rec->fields;
>          for (i = 0; i < rec->cnt; i++) {
>                  struct btf_struct_meta *pointee_struct_meta;
>                  const struct btf_field *field = &fields[i];
>                  void *field_ptr = obj + field->offset;
>                  void *xchgd_field;
>
>                  switch (fields[i].type) {
>                  case BPF_SPIN_LOCK:
>                          break;
>                  case BPF_TIMER:
>                          bpf_timer_cancel_and_free(field_ptr);
>                          break;
>                  case BPF_KPTR_UNREF:
>                          WRITE_ONCE(*(u64 *)field_ptr, 0);
>                          break;
>                  case BPF_KPTR_REF:
>                         ......
>                          break;
>                  case BPF_LIST_HEAD:
>                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>                                  continue;
>                          bpf_list_head_free(field, field_ptr, obj +
> rec->spin_lock_off);
>                          break;
>                  case BPF_RB_ROOT:
>                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>                                  continue;
>                          bpf_rb_root_free(field, field_ptr, obj +
> rec->spin_lock_off);
>                          break;
>                  case BPF_LIST_NODE:
>                  case BPF_RB_NODE:
>                  case BPF_REFCOUNT:
>                          break;
>                  default:
>                          WARN_ON_ONCE(1);
>                          continue;
>                  }
>          }
> }
>
> For percpu kptr, the remaining possible actiionable fields are
>         BPF_LIST_HEAD and BPF_RB_ROOT
>
> So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
> list/rb nodes to unlink them from the list_head/rb_root.
>
> So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
> Depending on whether the correspondingrec
> with rb_node/list_node has ref count or not,
> it may call bpf_mem_free() or bpf_mem_free_rcu(). If
> bpf_mem_free() is called, then the field is immediately freed
> but it may be used by some bpf prog (under rcu) concurrently,
> could this be an issue?

I see. Yeah. Looks like percpu makes such fields refcount-like.
For non-percpu non-refcount only one bpf prog on one cpu can observe
that object. That's why we're doing plain bpf_mem_free() for them.

So this patch is a good fix for refcounted, but you and Kumar are
correct that it's not sufficient for the case when percpu struct
includes multiple rb_roots. One for each cpu.

> Changing bpf_mem_free() in
> __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.

I guess we can do that when obj is either refcount or can be
insider percpu, but it might not be enough. More below.

> Another thing is related to list_head/rb_root.
> During bpf_obj_free_fields(), is it possible that another cpu
> may allocate a list_node/rb_node and add to list_head/rb_root?

It's not an issue for the single owner case and for refcounted.
Access to rb_root/list_head is always lock protected.
For refcounted the obj needs to be acquired (from the verifier pov)
meaning to have refcount =1 to be able to do spin_lock and
operate on list_head.

But bpf_rb_root_free is indeed an issue for percpu, since each
percpu has its own rb root field with its own bpf_spinlock, but
for_each_cpu() {bpf_obj_free_fields();} violates access contract.

percpu and rb_root creates such a maze of dependencies that
I think it's better to disallow rb_root-s and kptr-s inside percpu
for now.

> If this is true, then we might have a memory leak.
> But I don't whether this is possible or not.
>
> I think local kptr has the issue as percpu kptr.

Let's tackle one at a time.
I still think Dave's patch set is a good fix for recounted,
while we need to think more about percpu case.