Re: [PATCH bpf] bpf: Fix a race condition between btf_put() and map_free()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 12/5/2023 12:15 PM, Yonghong Song wrote:
>
> On 12/4/23 8:31 PM, Hou Tao wrote:
>> Hi,
>>
>> On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
>>> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song
>>> <yonghong.song@xxxxxxxxx> wrote:
>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>> I once hit a kasan error like below:
>> SNIP
>>>>
>>>> So the problem is at rec->refcount_off in the above.
>>>>
>>>> I did some source code analysis and find the reason.
>>>>                                    CPU A                        CPU B
>>>>    bpf_map_put:
>>>>      ...
>>>>      btf_put with rcu callback
>>>>      ...
>>>>      bpf_map_free_deferred
>>>>        with system_unbound_wq
>>>>      ...                          ...                           ...
>>>>      ...                          btf_free_rcu:                 ...
>>>>      ...                          ...                          
>>>> bpf_map_free_deferred:
>>>>      ...                          ...
>>>>      ...         --------->       btf_struct_metas_free()
>>>>      ...         | race condition ...
>>>>      ...         --------->                                    
>>>> map->ops->map_free()
>>>>      ...
>>>>      ...                          btf->struct_meta_tab = NULL
>>>>
>>>> In the above, map_free() corresponds to array_map_free() and
>>>> eventually
>>>> calling bpf_rb_root_free() which calls:
>>>>    ...
>>>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>>    ...
>>>>
>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>> following code:
>>>>
>>>>    meta = btf_find_struct_meta(btf, btf_id);
>>>>    if (!meta)
>>>>      return -EFAULT;
>>>>    rec->fields[i].graph_root.value_rec = meta->record;
>>>>
>>>> So basically, 'value_rec' is a pointer to the record in
>>>> struct_metas_tab.
>>>> And it is possible that that particular record has been freed by
>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>
>>>> Actually it is very hard to reproduce the failure with current
>>>> bpf/bpf-next
>>>> code, I only got the above error once. To increase reproducibility,
>>>> I added
>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>> which
>>>> significantly increased reproducibility.
>> Also found the problem when developing the "fix the release of inner
>> map" patch-set. I have written a selftest which could reliably reproduce
>> the problem by using map-in-map + bpf_list_head. The reason of using
>> map-in-map is to delay the release of inner map by using call_rcu() as
>> well, so the free of bpf_map happens after the release of btf. Will post
>> it later.
>>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>>>    --- a/kernel/bpf/syscall.c
>>>>    +++ b/kernel/bpf/syscall.c
>>>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>> work_struct *work)
>>>>          struct bpf_map *map = container_of(work, struct bpf_map,
>>>> work);
>>>>          struct btf_record *rec = map->record;
>>>>
>>>>    +     mdelay(100);
>>>>          security_bpf_map_free(map);
>>>>          bpf_map_release_memcg(map);
>>>>          /* implementation dependent freeing */
>>>>
>>>> To fix the problem, I moved btf_put() after map->ops->map_free() to
>>>> ensure
>>>> struct_metas available during map_free(). Rerun './test_progs -j'
>>>> with the
>>>> above mdelay() hack for a couple of times and didn't observe the
>>>> error.
>>>>
>>>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>>>> Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
>>>> ---
>>>>   kernel/bpf/syscall.c | 6 +++++-
>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 0ed286b8a0f0..9c6c3738adfe 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct
>>>> work_struct *work)
>>>>   {
>>>>          struct bpf_map *map = container_of(work, struct bpf_map,
>>>> work);
>>>>          struct btf_record *rec = map->record;
>>>> +       struct btf *btf = map->btf;
>>>>
>>>>          security_bpf_map_free(map);
>>>>          bpf_map_release_memcg(map);
>>>>          /* implementation dependent freeing */
>>>>          map->ops->map_free(map);
>>>> +       /* Delay freeing of btf for maps, as map_free callback may
>>>> need
>>>> +        * struct_meta info which will be freed with btf_put().
>>>> +        */
>>>> +       btf_put(btf);
>>> The change makes sense to me, but logically I'd put it after
>>> btf_record_free(rec), just in case if some of btf records ever refer
>>> back to map's BTF or something (and just in general to keep it the
>>> very last thing that's happening).
>>>
>>>
>>> But it also seems like CI is not happy ([0]), please take a look,
>>> thanks!
>>>
>>>    [0]
>>> https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
>> The patch delays the release of BTF id of bpf map, so test_btf_id
>> failed. Can we fix the problem by optionally pinning the btf in
>> btf_field_graph_root just like btf_field_kptr, so the map BTF will still
>> be alive before the invocation of btf_record_free() ? We need to do the
>> pinning optionally, because btf_record may be contained in btf directly
>> (namely btf->struct_meta_tab) and is freed through btf_free().
>
> Thanks for suggestion, I guess you want two cases:
>   - if map->record won't access any btf data (e.g.,
> btf->struct_meta_tab),
>     we should keep current btf_put() workflow,
>   - if map->record accesses some btf data, we should call btf_put()
>     immediately before or after btf_record_free().

Er, it is not what I want, although I have written a similar patch in
which bpf_map_put() will call btf_put() and set map->btf as NULL if
there is no BPF_LIST_HEAD and BPF_RB_ROOT fields in map->record,
otherwise calling bpf_put() in bpf_put_free_deferred(). What I have
suggested is to optionally pin btf in graph_root.btf just like
btf_field_kptr does.
>
> This could be done but we need to be careful to find all cases
> btf data might be accessed in map->record. The current approach
> is simpler. I will post v2 with the change Andrii suggested and
> fixed the failed test.
>
> If people really want to fine tune this like the above two cases, I can
> investigate too.
>
>>>
>>>>          /* Delay freeing of btf_record for maps, as map_free
>>>>           * callback usually needs access to them. It is better to
>>>> do it here
>>>>           * than require each callback to do the free itself manually.
>>>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>>>          if (atomic64_dec_and_test(&map->refcnt)) {
>>>>                  /* bpf_map_free_id() must be called first */
>>>>                  bpf_map_free_id(map);
>>>> -               btf_put(map->btf);
>>>>                  INIT_WORK(&map->work, bpf_map_free_deferred);
>>>>                  /* Avoid spawning kworkers, since they all might
>>>> contend
>>>>                   * for the same mutex like slab_mutex.
>>>> -- 
>>>> 2.34.1
>>>>
>>> .
>
> .





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux