On 31/05/2024 03:22, Eduard Zingerman wrote: > On Tue, 2024-05-28 at 13:24 +0100, Alan Maguire wrote: > > [...] > >> +/* Build a map from distilled base BTF ids to base BTF ids. To do so, iterate >> + * through base BTF looking up distilled type (using binary search) equivalents. >> + */ >> +static int btf_relocate_map_distilled_base(struct btf_relocate *r) >> +{ > > I have several observations about this algorithm. > > The algorithm does Base.Cnt * log2(Dist.Cnt) binary searches. > However, it might be better to switch searches around > and look for distilled {name,size} pairs in base btf, > doing Dist.Cnt * log2(Base.cnt) searches instead. > Suppose Base.Cnt = 2**20 and Dist.Cnt = 2**10, in such case: > - Base.Cnt * log2(Dist.Cnt) = 2**20 * 10 > - Dist.Cnt * log2(Base.cnt) = 2**10 * 20, which is smaller > Hi Eduard, I crunched some numbers on base, distilled base BTF to try and flesh this out a bit more. The number of struct, union, fwd, int, float, enum and enum64 types in my vmlinux BTF is 14307. This is 11% of the overall number of types, and it is this 11% we will be searching distilled BTF for matches (we avoid searching for any other types as we've previously verified that our distilled base BTF only consists of these). In terms of distilled BTF sizes, I built a kernel forcing distilled base BTF for all 2708 modules to help get a sense for distilled .BTF.base sizes. We can sort these by .BTF.base size by doing the following: modules=$(find . -name '*.ko' -depth -print); for module in $modules ; do size=$(objdump -h $module |awk '/.BTF.base/ { print $3}'); printf '%s %d \n' $module "0x0$size" ; done |sort -n -k2 With this process, we see the largest .BTF.base section is ./drivers/net/ethernet/chelsio/cxgb4/cxgb4.ko 15732 ...but most others are much less than 1k bytes in size. The in-tree kernel modules probably utilize more kernel types so I suspect give us a good sense of the worst case scenario for distilled BTF size. Examining the .BTF.base section of cxgb4, we see 609 types. So in this worst-case scenario I could generate, Base.Cnt = 14307, Dist.Cnt = 600 Base.Cnt * log2(Dist.Cnt) = 14307 * 9 = 128763 comparisons Dist.Cnt * log2(Base.cnt) = 600 * 14 = 8400 comparisons So that looks pretty cut-and-dried, but we also have to factor in the initial sort comparisons. The in-kernel sort reports n*log2(n) + 0.37*n + o(n) comparisons on average; for base BTF that means sorting requires at least Base: 14307*14+0.37*14307 = 205592 comparisons Dist: 600*9+0.37*600 = 5622 comparisons So we get an inversion of the above results, with (unless I'm miscalculating something), sorting distilled base BTF requiring less comparisons overall across both sort+search. Sort Comparisons Search comparisons Total ====================================================================== 5622 (distilled) 128763 (to base) 134385 205592 (base) 8400 (to distilled) 213992 To validate the distilled numbers are roughly right, I DTraced[1] comparison functions when loading cxgb4; as above we expect around 134385 across sort and search: $ sudo dtrace -n 'fbt::cmp_btf_name_size:entry {@c=count(); }' dtrace: description 'fbt::cmp_btf_name_size:entry ' matched 1 probe ^C 107971 So the number (107971 calls to cmp_btf_name_size) seem to be in the ballpark overall; next I tried aggregating by stack to see if the numbers look right in sort versus search: $ sudo dtrace -n 'fbt::cmp_btf_name_size:entry {@c[stack()]=count(); }' dtrace: description 'fbt::cmp_btf_name_size:entry ' matched 1 probe ^C vmlinux`cmp_btf_name_size+0x5 vmlinux`sort+0x34 vmlinux`btf_relocate_map_distilled_base+0xce vmlinux`btf_relocate+0x1e3 vmlinux`btf_parse_module+0x24b vmlinux`btf_module_notify+0xee vmlinux`notifier_call_chain+0x65 vmlinux`blocking_notifier_call_chain_robust+0x67 vmlinux`load_module+0x7fa vmlinux`init_module_from_file+0x97 vmlinux`idempotent_init_module+0x109 vmlinux`__x64_sys_finit_module+0x64 vmlinux`x64_sys_call+0x1480 vmlinux`do_syscall_64+0x68 vmlinux`entry_SYSCALL_64_after_hwframe+0x76 5882 vmlinux`cmp_btf_name_size+0x5 vmlinux`btf_relocate_map_distilled_base+0x29f vmlinux`btf_relocate+0x1e3 vmlinux`btf_parse_module+0x24b vmlinux`btf_module_notify+0xee vmlinux`notifier_call_chain+0x65 vmlinux`blocking_notifier_call_chain_robust+0x67 vmlinux`load_module+0x7fa vmlinux`init_module_from_file+0x97 vmlinux`idempotent_init_module+0x109 vmlinux`__x64_sys_finit_module+0x64 vmlinux`x64_sys_call+0x1480 vmlinux`do_syscall_64+0x68 vmlinux`entry_SYSCALL_64_after_hwframe+0x76 102089 Yep, looks right - 5882 sort comparisons versus 102089 search comparisons. I also traced the relocation time for btf_relocate() to complete in-kernel, collecting the module name, distilled number of types, base number of types and time for btf_relocate() to complete. I loaded 6 modules with varying numbers of distilled base BTF types. $ sudo dtrace -n 'fbt::btf_relocate:entry { self->start = timestamp; self->btf = (struct btf *)arg0; self->nr_dist_types = self->btf->base_btf->nr_types } fbt::btf_relocate:return /self->start/ { @reloc[stringof(self->btf->name), self->nr_dist_types, self->btf->base_btf->nr_types] = avg(timestamp-self->start); }' dtrace: description 'fbt::btf_relocate:entry ' matched 2 probes ^C hid_ite 11 124271 4432193 lib80211 12 124271 4445910 sctp 109 124271 5547703 ib_core 153 124271 5803176 bnxt_en 147 124271 5846436 cxgb4 610 124271 8081113 So the overall relocation time - from 11 distilled types in hid_ite to 610 for cxgb4 - is within a range from 4.5msec (4432193ns above) to 8msec. The times for relocation represent less than 50% of overall module load times - the later vary from 11-18msec across these modules. It would be great to find some performance wins here, but I don't _think_ swapping the sort/search targets will buy us much unfortunately. > The algorithm might not handle name duplicates in the distilled BTF well, > e.g. in theory, the following is a valid C code > > struct foo { int f; }; // sizeof(struct foo) == 4 > typedef int foo; // sizeof(foo) == 4 > > Suppose that these types are a part of the distilled BTF. > Depending on which one would end up first in 'dist_base_info_sorted' > bsearch might fail to find one or the other. > In the case of distilled base BTF, only struct, union, enum, enum64, int, float and fwd can be present. Size matches would have to be between one of these kinds I think, but are still possible nevertheless. > Also, algorithm does not report an error if there are several > types with the same name and size in the base BTF. > Yep, while we have to handle this, it only becomes an ambiguity problem if distilled base BTF refers to one of such types. On my vmlinux I see the following duplicate name/size STRUCTs 'd_partition' size=16 'elf_note_info' size=248 'getdents_callback' size=40 'instance_attribute' size=32 'intel_pinctrl_context' size=16 'intel_pinctrl' size=744 'perf_aux_event'size=16 'quirk_entry'size=8 Of these, 5 seem legit: d_partition, getdents_callback, instance_attribute, perf_aux_event, quirk_entry. A few seem to be identical, possibly dedup failures: struct intel_pinctrl { struct device *dev; raw_spinlock_t lock; struct pinctrl_desc pctldesc; struct pinctrl_dev *pctldev; struct gpio_chip chip; const struct intel_pinctrl_soc_data *soc; struct intel_community *communities; size_t ncommunities; struct intel_pinctrl_context context; int irq; }; struct intel_pinctrl___2 { struct device *dev; raw_spinlock_t lock; struct pinctrl_desc pctldesc; struct pinctrl_dev *pctldev; struct gpio_chip chip; const struct intel_pinctrl_soc_data *soc; struct intel_community *communities; size_t ncommunities; struct intel_pinctrl_context___2 context; int irq; }; struct elf_thread_core_info; struct elf_note_info { struct elf_thread_core_info *thread; struct memelfnote psinfo; struct memelfnote signote; struct memelfnote auxv; struct memelfnote files; siginfo_t csigdata; size_t size; int thread_notes; }; struct elf_thread_core_info___2; struct elf_note_info___2 { struct elf_thread_core_info___2 *thread; struct memelfnote psinfo; struct memelfnote signote; struct memelfnote auxv; struct memelfnote files; compat_siginfo_t csigdata; size_t size; int thread_notes; }; Both of these share self-reference, either directly or indirectly so maybe it's a corner-case of dedup we're missing. I'll dig into these later. > I suggest to modify the algorithm as follows: > - let 'base_info_sorted' be a set of tuples {kind,name,size,id} > corresponding to the base BTF, sorted by kind, name and size; That was my first thought, but we can't always search by kind; for example it's possible the distilled base has a fwd and vmlinux only has a struct kind for the same type name; in such a case we'd want to support a match provided the fwd's kflag indicated a struct fwd. In fact looking at the code we're missing logic for the opposite condition (fwd only in base, struct in distilled base). I'll fix that. The other case is an enum in distilled base matching an enum64 or an enum. > - add a custom utility bsearch_unique, that behaves like bsearch, > but returns NULL if entry is non-unique with regards to current > predicate (e.g. use bsearch but also check neighbors); > - for each type D in the distilled base: > - use bsearch_unique to find entry E in 'base_info_sorted' > that matches D.{kind,name,size} sub-tuple; > - if E exists, set id_map[D] := E.id; > - if E does not exist: > - if id_map[D] == BTF_IS_EMBEDDED, report an error; > - if id_map[D] != BTF_IS_EMBEDDED: > - use bsearch_unique to find entry E in 'base_info_sorted' > that matches D.{kind,name} sub-tuple; > - if E exists, set id_map[D] := E.id; > - otherwise, report an error. > > This allows to: > - flip the search order, potentially gaining some speed; > - drop the 'base_name_cnt' array and logic; > - handle the above hypothetical name conflict example. > I think flipping the search order could gain search speed, but only at the expense of slowing things down overall due to the extra cost of having to sort so many more elements. I suspect it will mostly be a wash, though numbers above seem to suggest sorting distilled base may have an edge when we consider both search and sort. The question is probably which sort/search order is most amenable to handling the data and helping us deal with the edge cases like duplicates. With the existing scheme, I think catching cases of name duplicates in distilled base BTF and name/size duplicates in base BTF for types we want to relocate from distilled base BTF and erroring out would suffice; basically the following applied to this patch (patch 3 in the series) diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c index f2e91cdfb5cc..4e282ee8f183 100644 --- a/tools/lib/bpf/btf_relocate.c +++ b/tools/lib/bpf/btf_relocate.c @@ -113,6 +113,7 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) { struct btf_name_info *dist_base_info_sorted; struct btf_type *base_t, *dist_t, *split_t; + const char *last_name = NULL; __u8 *base_name_cnt = NULL; int err = 0; __u32 id; @@ -136,6 +137,19 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) qsort(dist_base_info_sorted, r->nr_dist_base_types, sizeof(*dist_base_info_sorted), cmp_btf_name_size); + /* It is possible - though highly unlikely - that duplicate-named types + * end up in distilled based BTF; error out if this is the case. + */ + for (id = 1; id < r->nr_dist_base_types; id++) { + if (last_name == dist_base_info_sorted[id].name) { + pr_warn("Multiple distilled base types [%u], [%u] share name '%s'; cannot relocate with base BTF.\n", + id - 1, id, last_name); + err = -EINVAL; + goto done; + } + last_name = dist_base_info_sorted[id].name; + } + /* Mark distilled base struct/union members of split BTF structs/unions * in id_map with BTF_IS_EMBEDDED; this signals that these types * need to match both name and size, otherwise embeddding the base @@ -272,6 +286,21 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) default: continue; } + if (r->id_map[dist_name_info->id] && + r->id_map[dist_name_info->id != BTF_IS_EMBEDDED) { + /* we already have a match; this tells us that + * multiple base types of the same name + * have the same size, since for cases where + * multiple types have the same name we match + * on name and size. In this case, we have + * no way of determining which to relocate + * to in base BTF, so error out. + */ + pr_warn("distilled base BTF type '%s' [%u], size %u has multiple candidates of the same size (ids [%u, %u]) in base BTF\n", + base_name_info.name, dist_name_info->id, base_t->size, + id, r->id_map[dist_name_info->id]); + err = -EINVAL; + goto done; + } /* map id and name */ r->id_map[dist_name_info->id] = id; r->str_map[dist_t->name_off] = base_t->name_off; With this change, I then tried using "bpftool btf dump -B vmlinux file $module" for each of the 2700-odd modules I force-generated .BTF.base sections for, to see if these conditions ever get triggered in practice (since with your BTF parsing changes that allows us to test relocation). They don't it seems - all modules could relocate successfully with vmlinux - which would suggest at least initially it might not be worth adding additional complexity to the algorithm to handle them, aside from error checks like the above. > Wdyt? > My personal take is that it would suffice to error out in some of the edge cases, but I'm open to other approaches too. Hopefully some of the data above helps us understand the costs of this approach at least. Thanks! Alan [1] https://github.com/oracle/dtrace-utils >> + struct btf_name_info *dist_base_info_sorted; >> + struct btf_type *base_t, *dist_t, *split_t; >> + __u8 *base_name_cnt = NULL; >> + int err = 0; >> + __u32 id; >> + >> + /* generate a sort index array of name/type ids sorted by name for >> + * distilled base BTF to speed name-based lookups. >> + */ >> + dist_base_info_sorted = calloc(r->nr_dist_base_types, sizeof(*dist_base_info_sorted)); >> + if (!dist_base_info_sorted) { >> + err = -ENOMEM; >> + goto done; >> + } >> + for (id = 0; id < r->nr_dist_base_types; id++) { >> + dist_t = btf_type_by_id(r->dist_base_btf, id); >> + dist_base_info_sorted[id].name = btf__name_by_offset(r->dist_base_btf, >> + dist_t->name_off); >> + dist_base_info_sorted[id].id = id; >> + dist_base_info_sorted[id].size = dist_t->size; >> + dist_base_info_sorted[id].needs_size = true; >> + } >> + qsort(dist_base_info_sorted, r->nr_dist_base_types, sizeof(*dist_base_info_sorted), >> + cmp_btf_name_size); >> + >> + /* Mark distilled base struct/union members of split BTF structs/unions >> + * in id_map with BTF_IS_EMBEDDED; this signals that these types >> + * need to match both name and size, otherwise embeddding the base >> + * struct/union in the split type is invalid. >> + */ >> + for (id = r->nr_dist_base_types; id < r->nr_split_types; id++) { >> + split_t = btf_type_by_id(r->btf, id); >> + if (btf_is_composite(split_t)) { >> + err = btf_type_visit_type_ids(split_t, btf_mark_embedded_composite_type_ids, >> + r); >> + if (err < 0) >> + goto done; >> + } >> + } >> + >> + /* Collect name counts for composite types in base BTF. If multiple >> + * instances of a struct/union of the same name exist, we need to use >> + * size to determine which to map to since name alone is ambiguous. >> + */ >> + base_name_cnt = calloc(r->base_str_len, sizeof(*base_name_cnt)); >> + if (!base_name_cnt) { >> + err = -ENOMEM; >> + goto done; >> + } >> + for (id = 1; id < r->nr_base_types; id++) { >> + base_t = btf_type_by_id(r->base_btf, id); >> + if (!btf_is_composite(base_t) || !base_t->name_off) >> + continue; >> + if (base_name_cnt[base_t->name_off] < 255) >> + base_name_cnt[base_t->name_off]++; >> + } >> + >> + /* Now search base BTF for matching distilled base BTF types. */ >> + for (id = 1; id < r->nr_base_types; id++) { >> + struct btf_name_info *dist_name_info, base_name_info = {}; >> + int dist_kind, base_kind; >> + >> + base_t = btf_type_by_id(r->base_btf, id); >> + /* distilled base consists of named types only. */ >> + if (!base_t->name_off) >> + continue; >> + base_kind = btf_kind(base_t); >> + base_name_info.id = id; >> + base_name_info.name = btf__name_by_offset(r->base_btf, base_t->name_off); >> + switch (base_kind) { >> + case BTF_KIND_INT: >> + case BTF_KIND_FLOAT: >> + case BTF_KIND_ENUM: >> + case BTF_KIND_ENUM64: >> + /* These types should match both name and size */ >> + base_name_info.needs_size = true; >> + base_name_info.size = base_t->size; >> + break; >> + case BTF_KIND_FWD: >> + /* No size considerations for fwds. */ >> + break; >> + case BTF_KIND_STRUCT: >> + case BTF_KIND_UNION: >> + /* Size only needs to be used for struct/union if there >> + * are multiple types in base BTF with the same name. >> + * If there are multiple _distilled_ types with the same >> + * name (a very unlikely scenario), that doesn't matter >> + * unless corresponding _base_ types to match them are >> + * missing. >> + */ >> + base_name_info.needs_size = base_name_cnt[base_t->name_off] > 1; >> + base_name_info.size = base_t->size; >> + break; >> + default: >> + continue; >> + } >> + dist_name_info = bsearch(&base_name_info, dist_base_info_sorted, >> + r->nr_dist_base_types, sizeof(*dist_base_info_sorted), >> + cmp_btf_name_size); >> + if (!dist_name_info) >> + continue; >> + if (!dist_name_info->id || dist_name_info->id > r->nr_dist_base_types) { >> + pr_warn("base BTF id [%d] maps to invalid distilled base BTF id [%d]\n", >> + id, dist_name_info->id); >> + err = -EINVAL; >> + goto done; >> + } >> + dist_t = btf_type_by_id(r->dist_base_btf, dist_name_info->id); >> + dist_kind = btf_kind(dist_t); >> + >> + /* Validate that the found distilled type is compatible. >> + * Do not error out on mismatch as another match may occur >> + * for an identically-named type. >> + */ >> + switch (dist_kind) { >> + case BTF_KIND_FWD: >> + switch (base_kind) { >> + case BTF_KIND_FWD: >> + if (btf_kflag(dist_t) != btf_kflag(base_t)) >> + continue; >> + break; >> + case BTF_KIND_STRUCT: >> + if (btf_kflag(base_t)) >> + continue; >> + break; >> + case BTF_KIND_UNION: >> + if (!btf_kflag(base_t)) >> + continue; >> + break; >> + default: >> + continue; >> + } >> + break; >> + case BTF_KIND_INT: >> + if (dist_kind != base_kind || >> + btf_int_encoding(base_t) != btf_int_encoding(dist_t)) >> + continue; >> + break; >> + case BTF_KIND_FLOAT: >> + if (dist_kind != base_kind) >> + continue; >> + break; >> + case BTF_KIND_ENUM: >> + /* ENUM and ENUM64 are encoded as sized ENUM in >> + * distilled base BTF. >> + */ >> + if (dist_kind != base_kind && base_kind != BTF_KIND_ENUM64) >> + continue; >> + break; >> + case BTF_KIND_STRUCT: >> + case BTF_KIND_UNION: >> + /* size verification is required for embedded >> + * struct/unions. >> + */ >> + if (r->id_map[dist_name_info->id] == BTF_IS_EMBEDDED && >> + base_t->size != dist_t->size) >> + continue; >> + break; >> + default: >> + continue; >> + } >> + /* map id and name */ >> + r->id_map[dist_name_info->id] = id; >> + r->str_map[dist_t->name_off] = base_t->name_off; >> + } >> + /* ensure all distilled BTF ids now have a mapping... */ >> + for (id = 1; id < r->nr_dist_base_types; id++) { >> + const char *name; >> + >> + if (r->id_map[id] && r->id_map[id] != BTF_IS_EMBEDDED) >> + continue; >> + dist_t = btf_type_by_id(r->dist_base_btf, id); >> + name = btf__name_by_offset(r->dist_base_btf, dist_t->name_off); >> + pr_warn("distilled base BTF type '%s' [%d] is not mapped to base BTF id\n", >> + name, id); >> + err = -EINVAL; >> + break; >> + } >> +done: >> + free(base_name_cnt); >> + free(dist_base_info_sorted); >> + return err; >> +} > > [...]