Hi, On 12/8/2023 12:02 PM, Yonghong Song wrote: > > On 12/7/23 7:59 PM, Yonghong Song wrote: >> >> On 12/7/23 5:23 PM, Martin KaFai Lau wrote: >>> On 12/6/23 1:09 PM, Yonghong Song wrote: >>>> When running `./test_progs -j` in my local vm with latest kernel, >>>> I once hit a kasan error like below: >>>> >>>> SNIP >>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with >>>> following code: >>>> >>>> meta = btf_find_struct_meta(btf, btf_id); >>>> if (!meta) >>>> return -EFAULT; >>>> rec->fields[i].graph_root.value_rec = meta->record; >>>> >>>> So basically, 'value_rec' is a pointer to the record in >>>> struct_metas_tab. >>>> And it is possible that that particular record has been freed by >>>> btf_struct_metas_free() and hence we have a kasan error here. >>>> >>>> Actually it is very hard to reproduce the failure with current >>>> bpf/bpf-next >>>> code, I only got the above error once. To increase reproducibility, >>>> I added >>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), >>>> which >>>> significantly increased reproducibility. >>>> >>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>>> index 5e43ddd1b83f..aae5b5213e93 100644 >>>> --- a/kernel/bpf/syscall.c >>>> +++ b/kernel/bpf/syscall.c >>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct >>>> work_struct *work) >>>> struct bpf_map *map = container_of(work, struct bpf_map, >>>> work); >>>> struct btf_record *rec = map->record; >>>> >>>> + mdelay(100); >>>> security_bpf_map_free(map); >>>> bpf_map_release_memcg(map); >>>> /* implementation dependent freeing */ >>>> >>>> To fix the problem, we need to have a reference on btf in order to >>>> safeguard accessing field->graph_root.value_rec in >>>> map->ops->map_free(). >>>> The function btf_parse_graph_root() is the place to get a btf >>>> reference. >>>> The following are rough call stacks reaching bpf_parse_graph_root(): >>>> >>>> btf_parse >>>> ... >>>> btf_parse_fields >>>> ... >>>> btf_parse_graph_root >>>> >>>> map_check_btf >>>> btf_parse_fields >>>> ... >>>> btf_parse_graph_root >>>> >>>> Looking at the above call stack, the btf_parse_graph_root() is >>>> indirectly >>>> called from btf_parse() or map_check_btf(). >>>> >>>> We cannot take a reference in btf_parse() case since at that moment, >>>> btf is still in the middle to self-validation and initial reference >>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. >>> >>> Thanks for the details analysis and clear explanation. It helps a lot. >>> >>> Sorry for jumping in late. >>> >>> I am trying to avoid making a special case for "bool has_btf_ref;" >>> and "bool from_map_check". It seems to a bit too much to deal with >>> the error path for btf_parse(). Maybe we could move the common btf used by kptr and graph_root into bpf_record and let the callers of btf_parse_fields() and btf_record_free() to decide the life cycle of btf in btf_record, so there will be less intrusive and less special case. The following is the code snippets for the idea (only simply tested): diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1ad0ed623f50..a0c4d696a231 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -230,6 +230,7 @@ struct btf_record { int spin_lock_off; int timer_off; int refcount_off; + struct btf *btf; struct btf_field fields[]; }; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 966dace27fae..08b4a2784ee4 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3598,7 +3598,6 @@ static int btf_parse_kptr(struct btf *btf, struct btf_field *field, field->kptr.dtor = NULL; id = info->kptr.type_id; kptr_btf = btf; - btf_get(kptr_btf); goto found_dtor; } if (id < 0) @@ -3753,6 +3752,7 @@ struct btf_record *btf_parse_fields(struct btf *btf, const struct btf_type *t, if (!rec) return ERR_PTR(-ENOMEM); + rec->btf = btf; rec->spin_lock_off = -EINVAL; rec->timer_off = -EINVAL; rec->refcount_off = -EINVAL; diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 8ef269e66ba5..d9f4198158b6 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -56,6 +56,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) ret = PTR_ERR(inner_map_meta->record); goto free; } + if (inner_map_meta->record) + btf_get(inner_map_meta->record->btf); + /* Note: We must use the same BTF, as we also used btf_record_dup above * which relies on BTF being same for both maps, as some members like * record->fields.list_head have pointers like value_rec pointing into diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 30ed7756fc71..d2641e51a1a7 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -516,7 +516,8 @@ void btf_record_free(struct btf_record *rec) case BPF_KPTR_PERCPU: if (rec->fields[i].kptr.module) module_put(rec->fields[i].kptr.module); - btf_put(rec->fields[i].kptr.btf); + if (rec->fields[i].kptr.btf != rec->btf) + btf_put(rec->fields[i].kptr.btf); break; case BPF_LIST_HEAD: case BPF_LIST_NODE: @@ -537,8 +538,13 @@ void btf_record_free(struct btf_record *rec) void bpf_map_free_record(struct bpf_map *map) { + struct btf *btf = NULL; + + if (!IS_ERR_OR_NULL(map->record)) + btf = map->record->btf; btf_record_free(map->record); map->record = NULL; + btf_put(btf); } struct btf_record *btf_record_dup(const struct btf_record *rec) @@ -561,7 +567,8 @@ struct btf_record *btf_record_dup(const struct btf_record *rec) case BPF_KPTR_UNREF: case BPF_KPTR_REF: case BPF_KPTR_PERCPU: - btf_get(fields[i].kptr.btf); + if (fields[i].kptr.btf != rec->btf) + btf_get(fields[i].kptr.btf); if (fields[i].kptr.module && !try_module_get(fields[i].kptr.module)) { ret = -ENXIO; goto free; @@ -692,7 +699,10 @@ static void bpf_map_free_deferred(struct work_struct *work) { struct bpf_map *map = container_of(work, struct bpf_map, work); struct btf_record *rec = map->record; + struct btf *btf = NULL; + if (!IS_ERR_OR_NULL(rec)) + btf = rec->btf; security_bpf_map_free(map); bpf_map_release_memcg(map); /* implementation dependent freeing */ @@ -707,6 +717,7 @@ static void bpf_map_free_deferred(struct work_struct *work) * template bpf_map struct used during verification. */ btf_record_free(rec); + btf_put(btf); } static void bpf_map_put_uref(struct bpf_map *map) @@ -1036,6 +1047,8 @@ static int map_check_btf(struct bpf_map *map, struct btf *btf, if (!IS_ERR_OR_NULL(map->record)) { int i; + btf_get(map->record->btf); + if (!bpf_capable()) { ret = -EPERM; goto free_map_tab; WDYT ? >>> >>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse >>> help? >> >> No, it does not. The core reason is what Hao is mentioned in >> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@xxxxxxxxxxxxxxx/ >> >> We simply cannot take btf reference if called from btf_parse(). >> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() >> so we take ref for btf during btf_parse_fields(), then we have >> btf_put <=== expect refcount == 0 to start the destruction process >> ... >> btf_record_free <=== in which if graph_root, a btf reference >> will be hold >> so btf_put will never be able to actually free btf data. >> Yes, the kasan problem will be resolved but we leak memory. > Let me send another version with better commit message. > > [...]