On Fri, 2024-05-31 at 16:38 +0100, Alan Maguire wrote: Hi Alan, > The in-kernel sort reports n*log2(n) + 0.37*n + o(n) comparisons on > average; for base BTF that means sorting requires at least > > Base: 14307*14+0.37*14307 = 205592 comparisons > Dist: 600*9+0.37*600 = 5622 comparisons > > So we get an inversion of the above results, with (unless I'm > miscalculating something), sorting distilled base BTF requiring less > comparisons overall across both sort+search. > > Sort Comparisons Search comparisons Total > ====================================================================== > 5622 (distilled) 128763 (to base) 134385 > 205592 (base) 8400 (to distilled) 213992 It was absolutely stupid of me to not include the base sort cost into the calculations, really embarrassing. Thank you for pointing this out and my apologies for suggesting such nonsense. [...] > > The algorithm might not handle name duplicates in the distilled BTF well, > > e.g. in theory, the following is a valid C code > > > > struct foo { int f; }; // sizeof(struct foo) == 4 > > typedef int foo; // sizeof(foo) == 4 > > > > Suppose that these types are a part of the distilled BTF. > > Depending on which one would end up first in 'dist_base_info_sorted' > > bsearch might fail to find one or the other. > > In the case of distilled base BTF, only struct, union, enum, enum64, > int, float and fwd can be present. Size matches would have to be between > one of these kinds I think, but are still possible nevertheless. As Andrii noted in a sibling reply, there is still a slim possibility for name duplicates in the distilled base. Imo, if we can catch the corner case we should. > > Also, algorithm does not report an error if there are several > > types with the same name and size in the base BTF. > > Yep, while we have to handle this, it only becomes an ambiguity problem > if distilled base BTF refers to one of such types. On my vmlinux I see > the following duplicate name/size STRUCTs As you noted, this situation is really easy to catch by checking if id_map slot is already occupied, so it should be checked. [...] > struct elf_thread_core_info___2; > > struct elf_note_info___2 { > struct elf_thread_core_info___2 *thread; > struct memelfnote psinfo; > struct memelfnote signote; > struct memelfnote auxv; > struct memelfnote files; > compat_siginfo_t csigdata; > size_t size; > int thread_notes; > }; > > Both of these share self-reference, either directly or indirectly so > maybe it's a corner-case of dedup we're missing. I'll dig into these later. This is interesting indeed. > > I suggest to modify the algorithm as follows: > > - let 'base_info_sorted' be a set of tuples {kind,name,size,id} > > corresponding to the base BTF, sorted by kind, name and size; > > That was my first thought, but we can't always search by kind; for > example it's possible the distilled base has a fwd and vmlinux only has > a struct kind for the same type name; in such a case we'd want to > support a match provided the fwd's kflag indicated a struct fwd. > > In fact looking at the code we're missing logic for the opposite > condition (fwd only in base, struct in distilled base). I'll fix that. > > The other case is an enum in distilled base matching an enum64 > or an enum. I think it could be possible to do some kinds normalization (e.g. represent fwd's as zero sized structs or unions in btf_name_info). I'll try to implement this and get back to you on Monday. [...] > I think flipping the search order could gain search speed, but only at > the expense of slowing things down overall due to the extra cost of > having to sort so many more elements. I suspect it will mostly be a > wash, though numbers above seem to suggest sorting distilled base may > have an edge when we consider both search and sort. The question is > probably which sort/search order is most amenable to handling the data > and helping us deal with the edge cases like duplicates. Yes, you are absolutely correct. [...] > @@ -136,6 +137,19 @@ static int btf_relocate_map_distilled_base(struct > btf_relocate *r) > qsort(dist_base_info_sorted, r->nr_dist_base_types, > sizeof(*dist_base_info_sorted), > cmp_btf_name_size); > > + /* It is possible - though highly unlikely - that > duplicate-named types > + * end up in distilled based BTF; error out if this is the case. > + */ > + for (id = 1; id < r->nr_dist_base_types; id++) { > + if (last_name == dist_base_info_sorted[id].name) { > + pr_warn("Multiple distilled base types [%u], > [%u] share name '%s'; cannot relocate with base BTF.\n", > + id - 1, id, last_name); > + err = -EINVAL; > + goto done; > + } > + last_name = dist_base_info_sorted[id].name; > + } > + Nit: this rejects a case when both distilled types are embedded and a counterpart for each could be found in base. But that's a bit inconvenient to check for in the current framework. Probably not important. > /* Mark distilled base struct/union members of split BTF > structs/unions > * in id_map with BTF_IS_EMBEDDED; this signals that these types > * need to match both name and size, otherwise embeddding the base > @@ -272,6 +286,21 @@ static int btf_relocate_map_distilled_base(struct > btf_relocate *r) > default: > continue; > } > + if (r->id_map[dist_name_info->id] && > + r->id_map[dist_name_info->id != BTF_IS_EMBEDDED) { > + /* we already have a match; this tells us that > + * multiple base types of the same name > + * have the same size, since for cases where > + * multiple types have the same name we match > + * on name and size. In this case, we have > + * no way of determining which to relocate > + * to in base BTF, so error out. > + */ > + pr_warn("distilled base BTF type '%s' [%u], size > %u has multiple candidates of the same size (ids [%u, %u]) in base BTF\n", > + base_name_info.name, dist_name_info->id, > base_t->size, > + id, r->id_map[dist_name_info->id]); > + err = -EINVAL; > + goto done; > + } I think this hunk should be added. [...] Best regards, Eduard