Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
On 12/6/23 1:09 PM, Yonghong Song wrote:
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

   [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0    [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
   [ 1887.186498]
   [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494    [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
   [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
   [ 1887.190341] Call Trace:
   [ 1887.190666]  <TASK>
   [ 1887.190949]  dump_stack_lvl+0xac/0xe0
   [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
   [ 1887.192019]  ? panic+0x3c0/0x3c0
   [ 1887.192449]  print_report+0x14f/0x720
   [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
   [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
   [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
   [ 1887.194572]  kasan_report+0xc3/0x100
   [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
   [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
   [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
   [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
   [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
   [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
   [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
   [ 1887.198883]  array_map_free+0x1a3/0x260
   [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
   [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
   [ 1887.200549]  worker_thread+0x633/0x890
   [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
   [ 1887.201574]  ? kthread+0x102/0x1d0
   [ 1887.202020]  kthread+0x1ab/0x1d0
   [ 1887.202447]  ? pr_cont_work+0x270/0x270
   [ 1887.202954]  ? kthread_blkcg+0x50/0x50
   [ 1887.203444]  ret_from_fork+0x34/0x50
   [ 1887.203914]  ? kthread_blkcg+0x50/0x50
   [ 1887.204397]  ret_from_fork_asm+0x11/0x20
   [ 1887.204913]  </TASK>
   [ 1887.204913]  </TASK>
   [ 1887.205209]
   [ 1887.205416] Allocated by task 2197:
   [ 1887.205881]  kasan_set_track+0x3f/0x60
   [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
   [ 1887.206856]  __kmalloc+0xac/0x1a0
   [ 1887.207293]  btf_parse_fields+0xa15/0x1480
   [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
   [ 1887.208387]  btf_new_fd+0x294/0x4d0
   [ 1887.208851]  __sys_bpf+0x4ba/0x600
   [ 1887.209292]  __x64_sys_bpf+0x41/0x50
   [ 1887.209762]  do_syscall_64+0x4c/0xf0
   [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
   [ 1887.210868]
   [ 1887.211074] Freed by task 36:
   [ 1887.211460]  kasan_set_track+0x3f/0x60
   [ 1887.211951]  kasan_save_free_info+0x28/0x40
   [ 1887.212485]  ____kasan_slab_free+0x101/0x180
   [ 1887.213027]  __kmem_cache_free+0xe4/0x210
   [ 1887.213514]  btf_free+0x5b/0x130
   [ 1887.213918]  rcu_core+0x638/0xcc0
   [ 1887.214347]  __do_softirq+0x114/0x37e

The error happens at bpf_rb_root_free+0x1f8/0x2b0:

   00000000000034c0 <bpf_rb_root_free>:
   ; {
     34c0: f3 0f 1e fa                   endbr64
     34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
     34c9: 55                            pushq   %rbp
     34ca: 48 89 e5                      movq    %rsp, %rbp
   ...
   ;       if (rec && rec->refcount_off >= 0 &&
     36aa: 4d 85 ed                      testq   %r13, %r13
     36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
     36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
     36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                         <==== kasan function
     36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                         <==== use-after-free load
     36bc: 45 85 ff                      testl   %r15d, %r15d
     36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                   CPU A CPU B
   bpf_map_put:
     ...
     btf_put with rcu callback
     ...
     bpf_map_free_deferred
       with system_unbound_wq
     ...                          ... ...
     ...                          btf_free_rcu: ...
     ...                          ... bpf_map_free_deferred:
     ...                          ...
     ...         --------->       btf_struct_metas_free()
     ...         | race condition ...
     ... ---------> map->ops->map_free()
     ...
     ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
   ...
   __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
   ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

   meta = btf_find_struct_meta(btf, btf_id);
   if (!meta)
     return -EFAULT;
   rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.

   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
   index 5e43ddd1b83f..aae5b5213e93 100644
   --- a/kernel/bpf/syscall.c
   +++ b/kernel/bpf/syscall.c
   @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
         struct bpf_map *map = container_of(work, struct bpf_map, work);
         struct btf_record *rec = map->record;

   +     mdelay(100);
         security_bpf_map_free(map);
         bpf_map_release_memcg(map);
         /* implementation dependent freeing */

To fix the problem, we need to have a reference on btf in order to
safeguard accessing field->graph_root.value_rec in map->ops->map_free().
The function btf_parse_graph_root() is the place to get a btf reference.
The following are rough call stacks reaching bpf_parse_graph_root():

    btf_parse
      ...
        btf_parse_fields
          ...
            btf_parse_graph_root

    map_check_btf
      btf_parse_fields
        ...
          btf_parse_graph_root

Looking at the above call stack, the btf_parse_graph_root() is indirectly
called from btf_parse() or map_check_btf().

We cannot take a reference in btf_parse() case since at that moment,
btf is still in the middle to self-validation and initial reference
(refcount_set(&btf->refcnt, 1)) has not been triggered yet.

Thanks for the details analysis and clear explanation. It helps a lot.

Sorry for jumping in late.

I am trying to avoid making a special case for "bool has_btf_ref;" and "bool from_map_check". It seems to a bit too much to deal with the error path for btf_parse().

Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?

No, it does not. The core reason is what Hao is mentioned in
  https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@xxxxxxxxxxxxxxx/
We simply cannot take btf reference if called from btf_parse().
Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
so we take ref for btf during btf_parse_fields(), then we have
     btf_put <=== expect refcount == 0 to start the destruction process
       ...
         btf_record_free <=== in which if graph_root, a btf reference will be hold
so btf_put will never be able to actually free btf data.
Yes, the kasan problem will be resolved but we leak memory.


It is also unnecessary to take a reference since the value_rec is
referring to a record in struct_meta_tab.

If we optimize for not taking a refcnt, how about not taking a refcnt for all cases and postpone the btf_put(), instead of taking refcnt in one case but not another. Like your fix in v1. The failed selftest can be changed or even removed if it does not make sense anymore.

After a couple of iterations, I think taking necessary reference approach sounds better
and this will be consistent with how kptr is handled. For kptr, btf_parse will ignore it.
Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check will both
process it and that adds a little bit complexity.
Alexei also suggested the same taking reference approach:
  https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@xxxxxxxxxxxxxx/



But we do need to take a reference for map_check_btf() case as
the value_rec refers to a record in struct_meta_tab and we want
to make sure btf is not freed until value_rec usage at map_free
is done.

So the fix is to get a btf reference for map_check_btf() case
whenever a graph_root is found. A boolean field 'has_btf_ref' is added to
struct btf_field_graph_root so later the btf reference can be properly
released.

Rerun './test_progs -j' with the above mdelay() hack for a couple
of times and didn't observe the error.
Running Hou's test ([1]) is also successful.

   [1] https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@xxxxxxxxxxxxxxx/






[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux