On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > > On 12/8/23 12:30 AM, Hou Tao wrote: > > Hi, > > > > On 12/8/2023 12:02 PM, Yonghong Song wrote: > >> On 12/7/23 7:59 PM, Yonghong Song wrote: > >>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote: > >>>> On 12/6/23 1:09 PM, Yonghong Song wrote: > >>>>> When running `./test_progs -j` in my local vm with latest kernel, > >>>>> I once hit a kasan error like below: > >>>>> > >>>>> > > SNIP > >>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with > >>>>> following code: > >>>>> > >>>>> meta = btf_find_struct_meta(btf, btf_id); > >>>>> if (!meta) > >>>>> return -EFAULT; > >>>>> rec->fields[i].graph_root.value_rec = meta->record; > >>>>> > >>>>> So basically, 'value_rec' is a pointer to the record in > >>>>> struct_metas_tab. > >>>>> And it is possible that that particular record has been freed by > >>>>> btf_struct_metas_free() and hence we have a kasan error here. > >>>>> > >>>>> Actually it is very hard to reproduce the failure with current > >>>>> bpf/bpf-next > >>>>> code, I only got the above error once. To increase reproducibility, > >>>>> I added > >>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), > >>>>> which > >>>>> significantly increased reproducibility. > >>>>> > >>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >>>>> index 5e43ddd1b83f..aae5b5213e93 100644 > >>>>> --- a/kernel/bpf/syscall.c > >>>>> +++ b/kernel/bpf/syscall.c > >>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct > >>>>> work_struct *work) > >>>>> struct bpf_map *map = container_of(work, struct bpf_map, > >>>>> work); > >>>>> struct btf_record *rec = map->record; > >>>>> > >>>>> + mdelay(100); > >>>>> security_bpf_map_free(map); > >>>>> bpf_map_release_memcg(map); > >>>>> /* implementation dependent freeing */ > >>>>> > >>>>> To fix the problem, we need to have a reference on btf in order to > >>>>> safeguard accessing field->graph_root.value_rec in > >>>>> map->ops->map_free(). > >>>>> The function btf_parse_graph_root() is the place to get a btf > >>>>> reference. > >>>>> The following are rough call stacks reaching bpf_parse_graph_root(): > >>>>> > >>>>> btf_parse > >>>>> ... > >>>>> btf_parse_fields > >>>>> ... > >>>>> btf_parse_graph_root > >>>>> > >>>>> map_check_btf > >>>>> btf_parse_fields > >>>>> ... > >>>>> btf_parse_graph_root > >>>>> > >>>>> Looking at the above call stack, the btf_parse_graph_root() is > >>>>> indirectly > >>>>> called from btf_parse() or map_check_btf(). > >>>>> > >>>>> We cannot take a reference in btf_parse() case since at that moment, > >>>>> btf is still in the middle to self-validation and initial reference > >>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. > >>>> Thanks for the details analysis and clear explanation. It helps a lot. > >>>> > >>>> Sorry for jumping in late. > >>>> > >>>> I am trying to avoid making a special case for "bool has_btf_ref;" > >>>> and "bool from_map_check". It seems to a bit too much to deal with > >>>> the error path for btf_parse(). > > Maybe we could move the common btf used by kptr and graph_root into > > bpf_record and let the callers of btf_parse_fields() and > > btf_record_free() to decide the life cycle of btf in btf_record, so > > there will be less intrusive and less special case. The following is the > > I didn't fully check the code but looks like we took a > btf reference at map_check_btf() and free it at the end > of bpf_map_free_deferred(). This is similar to my v1 patch, > not exactly the same but similar since they all do > btf_put() at the end of bpf_map_free_deferred(). > > Through discussion, doing on-demand btf_get()/btf_put() > approach, similar to kptr approach, seems more favored. > This also has advantage to free btf at its earlist possible > point. Sorry. Looks like I recommended the wrong path. The approach of btf_parse_fields(... false | true) depending on where it's called and whether returned struct btf_record * will be kept within a type or within a map is pushing complexity too far. A year from now we'll forget these subtle details. There is an advantage to do btf_put() earli in bpf_map_put(), but in the common case it would be delayed just after queue_work. Which is a minor time delay. And for free_after_mult_rcu_gp much longer, but saving from freeing btf are minor compared to the map itself. I think it's cleaner to go back to v1 and simply move btf_put to bpf_map_free_deferred(). A lot less things to worry about. Especially considering that BPF_RB_ROOT may not be the last such special record keeping type and every new type would need to think hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like. v1 avoids this future complexity.