Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu: > > > On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu: > > > This patch added an option "merge_cus", which will permit > > > to merge all debug info cu's into one pahole cu. > > > For vmlinux built with clang thin-lto or lto, there exist > > > cross cu type references. For example, you could have > > > compile unit 1: > > > tag 10: type A > > > compile unit 2: > > > ... > > > refer to type A (tag 10 in compile unit 1) > > > I only checked a few but have seen type A may be a simple type > > > like "unsigned char" or a complex type like an array of base types. > > > > > > There are two different ways to resolve this issue: > > > (1). merge all compile units as one pahole cu so tags/types > > > can be resolved easily, or > > > (2). try to do on-demand type traversal in other debuginfo cu's > > > when we do die_process(). > > > The method (2) is much more complicated so I picked method (1). > > > An option "merge_cus" is added to permit such an operation. > > > > > > Merging cu's will create a single cu with lots of types, tags > > > and functions. For example with clang thin-lto built vmlinux, > > > I saw 9M entries in types table, 5.2M in tags table. The > > > below are pahole wallclock time for different hashbits: > > > command line: time pahole -J --merge_cus vmlinux > > > # of hashbits wallclock time in seconds > > > 15 460 > > > 16 255 > > > 17 131 > > > 18 97 > > > 19 75 > > > 20 69 > > > 21 64 > > > 22 62 > > > 23 58 > > > 24 64 > > > > > > Note that the number of hashbits 24 makes performance worse > > > than 23. The reason could be that 23 hashbits can cover 8M > > > buckets (close to 9M for the number of entries in types table). > > > Higher number of hash bits allocates more memory and becomes > > > less cache efficient compared to 23 hashbits. > > > > > > This patch picks # of hashbits 21 as the starting value > > > and will try to allocate memory based on that, if memory > > > allocation fails, we will go with less hashbits until > > > we reach hashbits 15 which is the default for > > > non merge-cu case. > > > > I'll probably add a way to specify the starting max_hashbits to be able > > to use 'perf stat' to show what causes the performance difference. > > The problem is with hashtags__find(), esp. the loop > > uint32_t bucket = hashtags__fn(id); > const struct hlist_head *head = hashtable + bucket; > > hlist_for_each_entry(tpos, pos, head, hash_node) { > if (tpos->id == id) > return tpos; > } > > Say we have 8M types and (1 << 15) buckets, that means > each bucket will 64 elements. So each lookup will traverse > the loop 32 iterations on average. > > If we have 1 << 21 buckets, then each buckets will have 4 elements, > and the average number of loop iterations for hashtags__find() > will be 2. > > If the patch needs respin, I can add the above descriptions > in the commit message. I can add that, as a comment. - Arnaldo