On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote:
Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu:
This patch added an option "merge_cus", which will permit
to merge all debug info cu's into one pahole cu.
For vmlinux built with clang thin-lto or lto, there exist
cross cu type references. For example, you could have
compile unit 1:
tag 10: type A
compile unit 2:
...
refer to type A (tag 10 in compile unit 1)
I only checked a few but have seen type A may be a simple type
like "unsigned char" or a complex type like an array of base types.
There are two different ways to resolve this issue:
(1). merge all compile units as one pahole cu so tags/types
can be resolved easily, or
(2). try to do on-demand type traversal in other debuginfo cu's
when we do die_process().
The method (2) is much more complicated so I picked method (1).
An option "merge_cus" is added to permit such an operation.
Merging cu's will create a single cu with lots of types, tags
and functions. For example with clang thin-lto built vmlinux,
I saw 9M entries in types table, 5.2M in tags table. The
below are pahole wallclock time for different hashbits:
command line: time pahole -J --merge_cus vmlinux
# of hashbits wallclock time in seconds
15 460
16 255
17 131
18 97
19 75
20 69
21 64
22 62
23 58
24 64
Note that the number of hashbits 24 makes performance worse
than 23. The reason could be that 23 hashbits can cover 8M
buckets (close to 9M for the number of entries in types table).
Higher number of hash bits allocates more memory and becomes
less cache efficient compared to 23 hashbits.
This patch picks # of hashbits 21 as the starting value
and will try to allocate memory based on that, if memory
allocation fails, we will go with less hashbits until
we reach hashbits 15 which is the default for
non merge-cu case.
I'll probably add a way to specify the starting max_hashbits to be able
to use 'perf stat' to show what causes the performance difference.
The problem is with hashtags__find(), esp. the loop
uint32_t bucket = hashtags__fn(id);
const struct hlist_head *head = hashtable + bucket;
hlist_for_each_entry(tpos, pos, head, hash_node) {
if (tpos->id == id)
return tpos;
}
Say we have 8M types and (1 << 15) buckets, that means
each bucket will 64 elements. So each lookup will traverse
the loop 32 iterations on average.
If we have 1 << 21 buckets, then each buckets will have 4 elements,
and the average number of loop iterations for hashtags__find()
will be 2.
If the patch needs respin, I can add the above descriptions
in the commit message.
I'm also adding the man page patch below, now to build the kernel with
your bpf-next patch to test it.
Thanks for adding man page and testing, let me know if you
need any help!
- Arnaldo
[acme@five pahole]$ git diff
diff --git a/man-pages/pahole.1 b/man-pages/pahole.1
index cbbefbf22556412c..1be2a293ad4bcc50 100644
--- a/man-pages/pahole.1
+++ b/man-pages/pahole.1
@@ -208,6 +208,12 @@ information has float types.
.B \-\-btf_gen_all
Allow using all the BTF features supported by pahole.
+.TP
+.B \-\-merge_cus
+Merge all cus (except possible types_cu) when loading DWARF, this is needed
+when processing files that have inter-CU references, this happens, for instance
+when building the Linux kernel with clang using thin-LTO or LTO.
+
.TP
.B \-l, \-\-show_first_biggest_size_base_type_member
Show first biggest size base_type member.
[acme@five pahole]$