Re: [PATCH bpf] bpf: Fix a race condition between btf_put() and map_free()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/4/23 8:31 PM, Hou Tao wrote:
Hi,

On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:
SNIP

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                   CPU A                        CPU B
   bpf_map_put:
     ...
     btf_put with rcu callback
     ...
     bpf_map_free_deferred
       with system_unbound_wq
     ...                          ...                           ...
     ...                          btf_free_rcu:                 ...
     ...                          ...                           bpf_map_free_deferred:
     ...                          ...
     ...         --------->       btf_struct_metas_free()
     ...         | race condition ...
     ...         --------->                                     map->ops->map_free()
     ...
     ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
   ...
   __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
   ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

   meta = btf_find_struct_meta(btf, btf_id);
   if (!meta)
     return -EFAULT;
   rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.
Also found the problem when developing the "fix the release of inner
map" patch-set. I have written a selftest which could reliably reproduce
the problem by using map-in-map + bpf_list_head. The reason of using
map-in-map is to delay the release of inner map by using call_rcu() as
well, so the free of bpf_map happens after the release of btf. Will post
it later.
   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
   index 5e43ddd1b83f..aae5b5213e93 100644
   --- a/kernel/bpf/syscall.c
   +++ b/kernel/bpf/syscall.c
   @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
         struct bpf_map *map = container_of(work, struct bpf_map, work);
         struct btf_record *rec = map->record;

   +     mdelay(100);
         security_bpf_map_free(map);
         bpf_map_release_memcg(map);
         /* implementation dependent freeing */

To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
struct_metas available during map_free(). Rerun './test_progs -j' with the
above mdelay() hack for a couple of times and didn't observe the error.

Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
---
  kernel/bpf/syscall.c | 6 +++++-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0ed286b8a0f0..9c6c3738adfe 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
  {
         struct bpf_map *map = container_of(work, struct bpf_map, work);
         struct btf_record *rec = map->record;
+       struct btf *btf = map->btf;

         security_bpf_map_free(map);
         bpf_map_release_memcg(map);
         /* implementation dependent freeing */
         map->ops->map_free(map);
+       /* Delay freeing of btf for maps, as map_free callback may need
+        * struct_meta info which will be freed with btf_put().
+        */
+       btf_put(btf);
The change makes sense to me, but logically I'd put it after
btf_record_free(rec), just in case if some of btf records ever refer
back to map's BTF or something (and just in general to keep it the
very last thing that's happening).


But it also seems like CI is not happy ([0]), please take a look, thanks!

   [0] https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
The patch delays the release of BTF id of bpf map, so test_btf_id
failed. Can we fix the problem by optionally pinning the btf in
btf_field_graph_root just like btf_field_kptr, so the map BTF will still
be alive before the invocation of btf_record_free() ? We need to do the
pinning optionally, because btf_record may be contained in btf directly
(namely btf->struct_meta_tab) and is freed through btf_free().

Thanks for suggestion, I guess you want two cases:
  - if map->record won't access any btf data (e.g., btf->struct_meta_tab),
    we should keep current btf_put() workflow,
  - if map->record accesses some btf data, we should call btf_put()
    immediately before or after btf_record_free().

This could be done but we need to be careful to find all cases
btf data might be accessed in map->record. The current approach
is simpler. I will post v2 with the change Andrii suggested and
fixed the failed test.

If people really want to fine tune this like the above two cases, I can
investigate too.


         /* Delay freeing of btf_record for maps, as map_free
          * callback usually needs access to them. It is better to do it here
          * than require each callback to do the free itself manually.
@@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
         if (atomic64_dec_and_test(&map->refcnt)) {
                 /* bpf_map_free_id() must be called first */
                 bpf_map_free_id(map);
-               btf_put(map->btf);
                 INIT_WORK(&map->work, bpf_map_free_deferred);
                 /* Avoid spawning kworkers, since they all might contend
                  * for the same mutex like slab_mutex.
--
2.34.1

.




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux