On 12/13/23 8:17 PM, Alexei Starovoitov wrote:
On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
On 12/8/23 12:30 AM, Hou Tao wrote:
Hi,
On 12/8/2023 12:02 PM, Yonghong Song wrote:
On 12/7/23 7:59 PM, Yonghong Song wrote:
On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
On 12/6/23 1:09 PM, Yonghong Song wrote:
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:
SNIP
Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
following code:
meta = btf_find_struct_meta(btf, btf_id);
if (!meta)
return -EFAULT;
rec->fields[i].graph_root.value_rec = meta->record;
So basically, 'value_rec' is a pointer to the record in
struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.
Actually it is very hard to reproduce the failure with current
bpf/bpf-next
code, I only got the above error once. To increase reproducibility,
I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
which
significantly increased reproducibility.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5e43ddd1b83f..aae5b5213e93 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
work_struct *work)
struct bpf_map *map = container_of(work, struct bpf_map,
work);
struct btf_record *rec = map->record;
+ mdelay(100);
security_bpf_map_free(map);
bpf_map_release_memcg(map);
/* implementation dependent freeing */
To fix the problem, we need to have a reference on btf in order to
safeguard accessing field->graph_root.value_rec in
map->ops->map_free().
The function btf_parse_graph_root() is the place to get a btf
reference.
The following are rough call stacks reaching bpf_parse_graph_root():
btf_parse
...
btf_parse_fields
...
btf_parse_graph_root
map_check_btf
btf_parse_fields
...
btf_parse_graph_root
Looking at the above call stack, the btf_parse_graph_root() is
indirectly
called from btf_parse() or map_check_btf().
We cannot take a reference in btf_parse() case since at that moment,
btf is still in the middle to self-validation and initial reference
(refcount_set(&btf->refcnt, 1)) has not been triggered yet.
Thanks for the details analysis and clear explanation. It helps a lot.
Sorry for jumping in late.
I am trying to avoid making a special case for "bool has_btf_ref;"
and "bool from_map_check". It seems to a bit too much to deal with
the error path for btf_parse().
Maybe we could move the common btf used by kptr and graph_root into
bpf_record and let the callers of btf_parse_fields() and
btf_record_free() to decide the life cycle of btf in btf_record, so
there will be less intrusive and less special case. The following is the
I didn't fully check the code but looks like we took a
btf reference at map_check_btf() and free it at the end
of bpf_map_free_deferred(). This is similar to my v1 patch,
not exactly the same but similar since they all do
btf_put() at the end of bpf_map_free_deferred().
Through discussion, doing on-demand btf_get()/btf_put()
approach, similar to kptr approach, seems more favored.
This also has advantage to free btf at its earlist possible
point.
Sorry. Looks like I recommended the wrong path.
The approach of btf_parse_fields(... false | true)
depending on where it's called and whether returned struct btf_record *
will be kept within a type or within a map
is pushing complexity too far.
A year from now we'll forget these subtle details.
There is an advantage to do btf_put() earli in bpf_map_put(),
but in the common case it would be delayed just after queue_work.
Which is a minor time delay.
And for free_after_mult_rcu_gp much longer,
but saving from freeing btf are minor compared to the map itself.
I think it's cleaner to go back to v1 and simply move btf_put
to bpf_map_free_deferred().
A lot less things to worry about.
Especially considering that BPF_RB_ROOT may not be the last such special
record keeping type and every new type would need to think
hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like.
v1 avoids this future complexity.
No problem. Will send v6 tomorrow with v1.