Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()

Yonghong Song <yonghong.song@xxxxxxxxx> · Wed, 13 Dec 2023 22:30:53 -0800

On 12/13/23 8:17 PM, Alexei Starovoitov wrote:
On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

On 12/8/23 12:30 AM, Hou Tao wrote:
Hi,

On 12/8/2023 12:02 PM, Yonghong Song wrote:
On 12/7/23 7:59 PM, Yonghong Song wrote:
On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
On 12/6/23 1:09 PM, Yonghong Song wrote:
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

SNIP
Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
following code:

     meta = btf_find_struct_meta(btf, btf_id);
     if (!meta)
       return -EFAULT;
     rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in
struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current
bpf/bpf-next
code, I only got the above error once. To increase reproducibility,
I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
which
significantly increased reproducibility.

     diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
     index 5e43ddd1b83f..aae5b5213e93 100644
     --- a/kernel/bpf/syscall.c
     +++ b/kernel/bpf/syscall.c
     @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
work_struct *work)
           struct bpf_map *map = container_of(work, struct bpf_map,
work);
           struct btf_record *rec = map->record;

     +     mdelay(100);
           security_bpf_map_free(map);
           bpf_map_release_memcg(map);
           /* implementation dependent freeing */

To fix the problem, we need to have a reference on btf in order to
safeguard accessing field->graph_root.value_rec in
map->ops->map_free().
The function btf_parse_graph_root() is the place to get a btf
reference.
The following are rough call stacks reaching bpf_parse_graph_root():

      btf_parse
        ...
          btf_parse_fields
            ...
              btf_parse_graph_root

      map_check_btf
        btf_parse_fields
          ...
            btf_parse_graph_root

Looking at the above call stack, the btf_parse_graph_root() is
indirectly
called from btf_parse() or map_check_btf().

We cannot take a reference in btf_parse() case since at that moment,
btf is still in the middle to self-validation and initial reference
(refcount_set(&btf->refcnt, 1)) has not been triggered yet.
Thanks for the details analysis and clear explanation. It helps a lot.

Sorry for jumping in late.

I am trying to avoid making a special case for "bool has_btf_ref;"
and "bool from_map_check". It seems to a bit too much to deal with
the error path for btf_parse().
Maybe we could move the common btf used by kptr and graph_root into
bpf_record and let the callers of btf_parse_fields()  and
btf_record_free() to decide the life cycle of btf in btf_record, so
there will be less intrusive and less special case. The following is the
I didn't fully check the code but looks like we took a
btf reference at map_check_btf() and free it at the end
of bpf_map_free_deferred(). This is similar to my v1 patch,
not exactly the same but similar since they all do
btf_put() at the end of bpf_map_free_deferred().

Through discussion, doing on-demand btf_get()/btf_put()
approach, similar to kptr approach, seems more favored.
This also has advantage to free btf at its earlist possible
point.
Sorry. Looks like I recommended the wrong path.

The approach of btf_parse_fields(... false | true)
depending on where it's called and whether returned struct btf_record *
will be kept within a type or within a map
is pushing complexity too far.
A year from now we'll forget these subtle details.
There is an advantage to do btf_put() earli in bpf_map_put(),
but in the common case it would be delayed just after queue_work.
Which is a minor time delay.
And for free_after_mult_rcu_gp much longer,
but saving from freeing btf are minor compared to the map itself.

I think it's cleaner to go back to v1 and simply move btf_put
to bpf_map_free_deferred().
A lot less things to worry about.
Especially considering that BPF_RB_ROOT may not be the last such special
record keeping type and every new type would need to think
hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like.
v1 avoids this future complexity.

No problem. Will send v6 tomorrow with v1.