Re: [PATCH v5 bpf-next 3/9] libbpf: split BTF relocation

Alan Maguire <alan.maguire@xxxxxxxxxx> · Fri, 31 May 2024 16:38:38 +0100

On 31/05/2024 03:22, Eduard Zingerman wrote:
> On Tue, 2024-05-28 at 13:24 +0100, Alan Maguire wrote:
> 
> [...]
> 
>> +/* Build a map from distilled base BTF ids to base BTF ids. To do so, iterate
>> + * through base BTF looking up distilled type (using binary search) equivalents.
>> + */
>> +static int btf_relocate_map_distilled_base(struct btf_relocate *r)
>> +{
> 
> I have several observations about this algorithm.
> 
> The algorithm does Base.Cnt * log2(Dist.Cnt) binary searches.
> However, it might be better to switch searches around
> and look for distilled {name,size} pairs in base btf,
> doing Dist.Cnt * log2(Base.cnt) searches instead.
> Suppose Base.Cnt = 2**20 and Dist.Cnt = 2**10, in such case:
>   - Base.Cnt * log2(Dist.Cnt) = 2**20 * 10
>   - Dist.Cnt * log2(Base.cnt) = 2**10 * 20, which is smaller
> 

Hi Eduard,

I crunched some numbers on base, distilled base BTF to try and flesh
this out a bit more.

The number of struct, union, fwd, int, float, enum and enum64 types in
my vmlinux BTF is 14307.

This is 11% of the overall number of types, and it is this 11% we will
be searching distilled BTF for matches (we avoid searching for any other
types as we've previously verified that our distilled base BTF only
consists of these).

In terms of distilled BTF sizes, I built a kernel forcing distilled base
BTF for all 2708 modules to help get a sense for distilled .BTF.base
sizes.  We can sort these by .BTF.base size by doing the following:

modules=$(find . -name '*.ko' -depth -print);
for module in $modules ; do size=$(objdump -h $module |awk '/.BTF.base/
{ print $3}'); printf '%s %d \n' $module "0x0$size"  ; done |sort -n -k2

With this process, we see the largest .BTF.base section is

./drivers/net/ethernet/chelsio/cxgb4/cxgb4.ko 15732

...but most others are much less than 1k bytes in size. The in-tree
kernel modules probably utilize more kernel types so I suspect give us a
good sense of the worst case scenario for distilled BTF size.

Examining the .BTF.base section of cxgb4, we see 609 types.

So in this worst-case scenario I could generate, Base.Cnt = 14307,
Dist.Cnt = 600

Base.Cnt * log2(Dist.Cnt) = 14307 * 9	= 128763 comparisons
Dist.Cnt * log2(Base.cnt) = 600 * 14 	= 8400 comparisons

So that looks pretty cut-and-dried, but we also have to factor in the
initial sort comparisons.

The in-kernel sort reports n*log2(n) + 0.37*n + o(n) comparisons on
average; for base BTF that means sorting requires at least

Base: 14307*14+0.37*14307	= 205592 comparisons
Dist: 600*9+0.37*600		= 5622 comparisons

So we get an inversion of the above results, with (unless I'm
miscalculating something), sorting distilled base BTF requiring less
comparisons overall across both sort+search.

Sort Comparisons		Search comparisons		Total
======================================================================	
5622	(distilled)		128763	(to base)		134385
205592	(base)			8400	(to distilled)		213992

To validate the distilled numbers are roughly right, I DTraced[1]
comparison functions when loading cxgb4; as above we expect around
134385 across sort and search:

$ sudo dtrace -n 'fbt::cmp_btf_name_size:entry {@c=count(); }'
dtrace: description 'fbt::cmp_btf_name_size:entry ' matched 1 probe
^C

           107971

So the number (107971 calls to cmp_btf_name_size) seem to be in the
ballpark overall; next I tried aggregating by stack to see if the
numbers look right in sort versus search:

$ sudo dtrace -n 'fbt::cmp_btf_name_size:entry {@c[stack()]=count(); }'
dtrace: description 'fbt::cmp_btf_name_size:entry ' matched 1 probe
^C


              vmlinux`cmp_btf_name_size+0x5
              vmlinux`sort+0x34
              vmlinux`btf_relocate_map_distilled_base+0xce
              vmlinux`btf_relocate+0x1e3
              vmlinux`btf_parse_module+0x24b
              vmlinux`btf_module_notify+0xee
              vmlinux`notifier_call_chain+0x65
              vmlinux`blocking_notifier_call_chain_robust+0x67
              vmlinux`load_module+0x7fa
              vmlinux`init_module_from_file+0x97
              vmlinux`idempotent_init_module+0x109
              vmlinux`__x64_sys_finit_module+0x64
              vmlinux`x64_sys_call+0x1480
              vmlinux`do_syscall_64+0x68
              vmlinux`entry_SYSCALL_64_after_hwframe+0x76
             5882

              vmlinux`cmp_btf_name_size+0x5
              vmlinux`btf_relocate_map_distilled_base+0x29f
              vmlinux`btf_relocate+0x1e3
              vmlinux`btf_parse_module+0x24b
              vmlinux`btf_module_notify+0xee
              vmlinux`notifier_call_chain+0x65
              vmlinux`blocking_notifier_call_chain_robust+0x67
              vmlinux`load_module+0x7fa
              vmlinux`init_module_from_file+0x97
              vmlinux`idempotent_init_module+0x109
              vmlinux`__x64_sys_finit_module+0x64
              vmlinux`x64_sys_call+0x1480
              vmlinux`do_syscall_64+0x68
              vmlinux`entry_SYSCALL_64_after_hwframe+0x76
           102089

Yep, looks right - 5882 sort comparisons versus 102089 search comparisons.

I also traced the relocation time for btf_relocate() to complete
in-kernel, collecting the module name, distilled number of types, base
number of types and time for btf_relocate() to complete. I loaded 6
modules with varying numbers of distilled base BTF types.

$ sudo dtrace -n 'fbt::btf_relocate:entry
{ self->start = timestamp;
 self->btf = (struct btf *)arg0;
 self->nr_dist_types = self->btf->base_btf->nr_types
}
fbt::btf_relocate:return /self->start/
{
 @reloc[stringof(self->btf->name), self->nr_dist_types,
self->btf->base_btf->nr_types] = avg(timestamp-self->start);
}'
dtrace: description 'fbt::btf_relocate:entry ' matched 2 probes
^C

  hid_ite                                                  11   124271
       4432193
  lib80211                                                 12   124271
       4445910
  sctp                                                    109   124271
       5547703
  ib_core                                                 153   124271
       5803176
  bnxt_en                                                 147   124271
       5846436
  cxgb4                                                   610   124271
       8081113

So the overall relocation time - from 11 distilled types in hid_ite to
610 for cxgb4 - is within a range from 4.5msec (4432193ns above) to
8msec. The times for relocation represent less than 50% of overall
module load times - the later vary from 11-18msec across these modules.
It would be great to find some performance wins here, but I don't
_think_ swapping the sort/search targets will buy us much unfortunately.

> The algorithm might not handle name duplicates in the distilled BTF well,
> e.g. in theory, the following is a valid C code
> 
>   struct foo { int f; }; // sizeof(struct foo) == 4
>   typedef int foo;       // sizeof(foo) == 4
> 
> Suppose that these types are a part of the distilled BTF.
> Depending on which one would end up first in 'dist_base_info_sorted'
> bsearch might fail to find one or the other.
> 

In the case of distilled base BTF, only struct, union, enum, enum64,
int, float and fwd can be present. Size matches would have to be between
one of these kinds I think, but are still possible nevertheless.

> Also, algorithm does not report an error if there are several
> types with the same name and size in the base BTF.
>

Yep, while we have to handle this, it only becomes an ambiguity problem
if distilled base BTF refers to one of such types. On my vmlinux I see
the following duplicate name/size STRUCTs

'd_partition' size=16
'elf_note_info' size=248
'getdents_callback' size=40
'instance_attribute' size=32
'intel_pinctrl_context' size=16
'intel_pinctrl' size=744
'perf_aux_event'size=16
'quirk_entry'size=8

Of these, 5 seem legit: d_partition, getdents_callback,
instance_attribute, perf_aux_event, quirk_entry.

A few seem to be identical, possibly dedup failures:

struct intel_pinctrl {
        struct device *dev;
        raw_spinlock_t lock;
        struct pinctrl_desc pctldesc;
        struct pinctrl_dev *pctldev;
        struct gpio_chip chip;
        const struct intel_pinctrl_soc_data *soc;
        struct intel_community *communities;
        size_t ncommunities;
        struct intel_pinctrl_context context;
        int irq;
};

struct intel_pinctrl___2 {
        struct device *dev;
        raw_spinlock_t lock;
        struct pinctrl_desc pctldesc;
        struct pinctrl_dev *pctldev;
        struct gpio_chip chip;
        const struct intel_pinctrl_soc_data *soc;
        struct intel_community *communities;
        size_t ncommunities;
        struct intel_pinctrl_context___2 context;
        int irq;
};


struct elf_thread_core_info;

struct elf_note_info {
        struct elf_thread_core_info *thread;
        struct memelfnote psinfo;
        struct memelfnote signote;
        struct memelfnote auxv;
        struct memelfnote files;
        siginfo_t csigdata;
        size_t size;
        int thread_notes;
};

struct elf_thread_core_info___2;

struct elf_note_info___2 {
        struct elf_thread_core_info___2 *thread;
        struct memelfnote psinfo;
        struct memelfnote signote;
        struct memelfnote auxv;
        struct memelfnote files;
        compat_siginfo_t csigdata;
        size_t size;
        int thread_notes;
};

Both of these share self-reference, either directly or indirectly so
maybe it's a corner-case of dedup we're missing. I'll dig into these later.

> I suggest to modify the algorithm as follows:
> - let 'base_info_sorted' be a set of tuples {kind,name,size,id}
>   corresponding to the base BTF, sorted by kind, name and size;

That was my first thought, but we can't always search by kind; for
example it's possible the distilled base has a fwd and vmlinux only has
a struct kind for the same type name; in such a case we'd want to
support a match provided the fwd's kflag indicated a struct fwd.

In fact looking at the code we're missing logic for the opposite
condition (fwd only in base, struct in distilled base). I'll fix that.

The other case is an enum in distilled base matching an enum64
or an enum.

> - add a custom utility bsearch_unique, that behaves like bsearch,
>   but returns NULL if entry is non-unique with regards to current
>   predicate (e.g. use bsearch but also check neighbors);
> - for each type D in the distilled base:
>   - use bsearch_unique to find entry E in 'base_info_sorted'
>     that matches D.{kind,name,size} sub-tuple;
>   - if E exists, set id_map[D] := E.id;
>   - if E does not exist:
>     - if id_map[D] == BTF_IS_EMBEDDED, report an error;
>     - if id_map[D] != BTF_IS_EMBEDDED:
>       - use bsearch_unique to find entry E in 'base_info_sorted'
>         that matches D.{kind,name} sub-tuple;
>       - if E exists, set id_map[D] := E.id;
>       - otherwise, report an error.
> 
> This allows to:
> - flip the search order, potentially gaining some speed;
> - drop the 'base_name_cnt' array and logic;
> - handle the above hypothetical name conflict example.
>

I think flipping the search order could gain search speed, but only at
the expense of slowing things down overall due to the extra cost of
having to sort so many more elements. I suspect it will mostly be a
wash, though numbers above seem to suggest sorting distilled base may
have an edge when we consider both search and sort. The question is
probably which sort/search order is most amenable to handling the data
and helping us deal with the edge cases like duplicates.

With the existing scheme, I think catching cases of name duplicates in
distilled base BTF and name/size duplicates in base BTF for types we
want to relocate from distilled base BTF and erroring out would suffice;
basically the following applied to this patch (patch 3 in the series)

diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c
index f2e91cdfb5cc..4e282ee8f183 100644
--- a/tools/lib/bpf/btf_relocate.c
+++ b/tools/lib/bpf/btf_relocate.c
@@ -113,6 +113,7 @@ static int btf_relocate_map_distilled_base(struct
btf_relocate *r)
 {
        struct btf_name_info *dist_base_info_sorted;
        struct btf_type *base_t, *dist_t, *split_t;
+       const char *last_name = NULL;
        __u8 *base_name_cnt = NULL;
        int err = 0;
        __u32 id;
@@ -136,6 +137,19 @@ static int btf_relocate_map_distilled_base(struct
btf_relocate *r)
        qsort(dist_base_info_sorted, r->nr_dist_base_types,
sizeof(*dist_base_info_sorted),
              cmp_btf_name_size);

+       /* It is possible - though highly unlikely - that
duplicate-named types
+        * end up in distilled based BTF; error out if this is the case.
+        */
+       for (id = 1; id < r->nr_dist_base_types; id++) {
+               if (last_name == dist_base_info_sorted[id].name) {
+                       pr_warn("Multiple distilled base types [%u],
[%u] share name '%s'; cannot relocate with base BTF.\n",
+                               id - 1, id, last_name);
+                       err = -EINVAL;
+                       goto done;
+               }
+               last_name = dist_base_info_sorted[id].name;
+       }
+
        /* Mark distilled base struct/union members of split BTF
structs/unions
         * in id_map with BTF_IS_EMBEDDED; this signals that these types
         * need to match both name and size, otherwise embeddding the base
@@ -272,6 +286,21 @@ static int btf_relocate_map_distilled_base(struct
btf_relocate *r)
                default:
                        continue;
                }
+               if (r->id_map[dist_name_info->id] &&
+ 		    r->id_map[dist_name_info->id != BTF_IS_EMBEDDED) {
+                       /* we already have a match; this tells us that
+                        * multiple base types of the same name
+                        * have the same size, since for cases where
+                        * multiple types have the same name we match
+                        * on name and size.  In this case, we have
+                        * no way of determining which to relocate
+                        * to in base BTF, so error out.
+                        */
+                       pr_warn("distilled base BTF type '%s' [%u], size
%u has multiple candidates of the same size (ids [%u, %u]) in base BTF\n",
+                               base_name_info.name, dist_name_info->id,
base_t->size,
+                               id, r->id_map[dist_name_info->id]);
+                       err = -EINVAL;
+                       goto done;
+               }
                /* map id and name */
                r->id_map[dist_name_info->id] = id;
                r->str_map[dist_t->name_off] = base_t->name_off;


With this change, I then tried using "bpftool btf dump -B vmlinux file
$module" for each of the 2700-odd modules I force-generated .BTF.base
sections for, to see if these conditions ever get triggered in practice
(since with your BTF parsing changes that allows us to test relocation).
 They don't it seems - all modules could relocate successfully with
vmlinux - which would suggest at least initially it might not be worth
adding additional complexity to the algorithm to handle them, aside from
error checks like the above.

> Wdyt?
> 

My personal take is that it would suffice to error out in some of the
edge cases, but I'm open to other approaches too. Hopefully some of the
data above helps us understand the costs of this approach at least. Thanks!

Alan

[1] https://github.com/oracle/dtrace-utils

>> +	struct btf_name_info *dist_base_info_sorted;
>> +	struct btf_type *base_t, *dist_t, *split_t;
>> +	__u8 *base_name_cnt = NULL;
>> +	int err = 0;
>> +	__u32 id;
>> +
>> +	/* generate a sort index array of name/type ids sorted by name for
>> +	 * distilled base BTF to speed name-based lookups.
>> +	 */
>> +	dist_base_info_sorted = calloc(r->nr_dist_base_types, sizeof(*dist_base_info_sorted));
>> +	if (!dist_base_info_sorted) {
>> +		err = -ENOMEM;
>> +		goto done;
>> +	}
>> +	for (id = 0; id < r->nr_dist_base_types; id++) {
>> +		dist_t = btf_type_by_id(r->dist_base_btf, id);
>> +		dist_base_info_sorted[id].name = btf__name_by_offset(r->dist_base_btf,
>> +								     dist_t->name_off);
>> +		dist_base_info_sorted[id].id = id;
>> +		dist_base_info_sorted[id].size = dist_t->size;
>> +		dist_base_info_sorted[id].needs_size = true;
>> +	}
>> +	qsort(dist_base_info_sorted, r->nr_dist_base_types, sizeof(*dist_base_info_sorted),
>> +	      cmp_btf_name_size);
>> +
>> +	/* Mark distilled base struct/union members of split BTF structs/unions
>> +	 * in id_map with BTF_IS_EMBEDDED; this signals that these types
>> +	 * need to match both name and size, otherwise embeddding the base
>> +	 * struct/union in the split type is invalid.
>> +	 */
>> +	for (id = r->nr_dist_base_types; id < r->nr_split_types; id++) {
>> +		split_t = btf_type_by_id(r->btf, id);
>> +		if (btf_is_composite(split_t)) {
>> +			err = btf_type_visit_type_ids(split_t, btf_mark_embedded_composite_type_ids,
>> +						      r);
>> +			if (err < 0)
>> +				goto done;
>> +		}
>> +	}
>> +
>> +	/* Collect name counts for composite types in base BTF.  If multiple
>> +	 * instances of a struct/union of the same name exist, we need to use
>> +	 * size to determine which to map to since name alone is ambiguous.
>> +	 */
>> +	base_name_cnt = calloc(r->base_str_len, sizeof(*base_name_cnt));
>> +	if (!base_name_cnt) {
>> +		err = -ENOMEM;
>> +		goto done;
>> +	}
>> +	for (id = 1; id < r->nr_base_types; id++) {
>> +		base_t = btf_type_by_id(r->base_btf, id);
>> +		if (!btf_is_composite(base_t) || !base_t->name_off)
>> +			continue;
>> +		if (base_name_cnt[base_t->name_off] < 255)
>> +			base_name_cnt[base_t->name_off]++;
>> +	}
>> +
>> +	/* Now search base BTF for matching distilled base BTF types. */
>> +	for (id = 1; id < r->nr_base_types; id++) {
>> +		struct btf_name_info *dist_name_info, base_name_info = {};
>> +		int dist_kind, base_kind;
>> +
>> +		base_t = btf_type_by_id(r->base_btf, id);
>> +		/* distilled base consists of named types only. */
>> +		if (!base_t->name_off)
>> +			continue;
>> +		base_kind = btf_kind(base_t);
>> +		base_name_info.id = id;
>> +		base_name_info.name = btf__name_by_offset(r->base_btf, base_t->name_off);
>> +		switch (base_kind) {
>> +		case BTF_KIND_INT:
>> +		case BTF_KIND_FLOAT:
>> +		case BTF_KIND_ENUM:
>> +		case BTF_KIND_ENUM64:
>> +			/* These types should match both name and size */
>> +			base_name_info.needs_size = true;
>> +			base_name_info.size = base_t->size;
>> +			break;
>> +		case BTF_KIND_FWD:
>> +			/* No size considerations for fwds. */
>> +			break;
>> +		case BTF_KIND_STRUCT:
>> +		case BTF_KIND_UNION:
>> +			/* Size only needs to be used for struct/union if there
>> +			 * are multiple types in base BTF with the same name.
>> +			 * If there are multiple _distilled_ types with the same
>> +			 * name (a very unlikely scenario), that doesn't matter
>> +			 * unless corresponding _base_ types to match them are
>> +			 * missing.
>> +			 */
>> +			base_name_info.needs_size = base_name_cnt[base_t->name_off] > 1;
>> +			base_name_info.size = base_t->size;
>> +			break;
>> +		default:
>> +			continue;
>> +		}
>> +		dist_name_info = bsearch(&base_name_info, dist_base_info_sorted,
>> +					 r->nr_dist_base_types, sizeof(*dist_base_info_sorted),
>> +					 cmp_btf_name_size);
>> +		if (!dist_name_info)
>> +			continue;
>> +		if (!dist_name_info->id || dist_name_info->id > r->nr_dist_base_types) {
>> +			pr_warn("base BTF id [%d] maps to invalid distilled base BTF id [%d]\n",
>> +				id, dist_name_info->id);
>> +			err = -EINVAL;
>> +			goto done;
>> +		}
>> +		dist_t = btf_type_by_id(r->dist_base_btf, dist_name_info->id);
>> +		dist_kind = btf_kind(dist_t);
>> +
>> +		/* Validate that the found distilled type is compatible.
>> +		 * Do not error out on mismatch as another match may occur
>> +		 * for an identically-named type.
>> +		 */
>> +		switch (dist_kind) {
>> +		case BTF_KIND_FWD:
>> +			switch (base_kind) {
>> +			case BTF_KIND_FWD:
>> +				if (btf_kflag(dist_t) != btf_kflag(base_t))
>> +					continue;
>> +				break;
>> +			case BTF_KIND_STRUCT:
>> +				if (btf_kflag(base_t))
>> +					continue;
>> +				break;
>> +			case BTF_KIND_UNION:
>> +				if (!btf_kflag(base_t))
>> +					continue;
>> +				break;
>> +			default:
>> +				continue;
>> +			}
>> +			break;
>> +		case BTF_KIND_INT:
>> +			if (dist_kind != base_kind ||
>> +			    btf_int_encoding(base_t) != btf_int_encoding(dist_t))
>> +				continue;
>> +			break;
>> +		case BTF_KIND_FLOAT:
>> +			if (dist_kind != base_kind)
>> +				continue;
>> +			break;
>> +		case BTF_KIND_ENUM:
>> +			/* ENUM and ENUM64 are encoded as sized ENUM in
>> +			 * distilled base BTF.
>> +			 */
>> +			if (dist_kind != base_kind && base_kind != BTF_KIND_ENUM64)
>> +				continue;
>> +			break;
>> +		case BTF_KIND_STRUCT:
>> +		case BTF_KIND_UNION:
>> +			/* size verification is required for embedded
>> +			 * struct/unions.
>> +			 */
>> +			if (r->id_map[dist_name_info->id] == BTF_IS_EMBEDDED &&
>> +			    base_t->size != dist_t->size)
>> +				continue;
>> +			break;
>> +		default:
>> +			continue;
>> +		}
>> +		/* map id and name */
>> +		r->id_map[dist_name_info->id] = id;
>> +		r->str_map[dist_t->name_off] = base_t->name_off;
>> +	}
>> +	/* ensure all distilled BTF ids now have a mapping... */
>> +	for (id = 1; id < r->nr_dist_base_types; id++) {
>> +		const char *name;
>> +
>> +		if (r->id_map[id] && r->id_map[id] != BTF_IS_EMBEDDED)
>> +			continue;
>> +		dist_t = btf_type_by_id(r->dist_base_btf, id);
>> +		name = btf__name_by_offset(r->dist_base_btf, dist_t->name_off);
>> +		pr_warn("distilled base BTF type '%s' [%d] is not mapped to base BTF id\n",
>> +			name, id);
>> +		err = -EINVAL;
>> +		break;
>> +	}
>> +done:
>> +	free(base_name_cnt);
>> +	free(dist_base_info_sorted);
>> +	return err;
>> +}
> 
> [...]