This patch series continues the work previously focused on sharing of the ELF functions table between BTF encoders [1]. A recap: * I've been looking into ways to improve the performance of reproducible (--btf_features=reproducible_build) DWARF->BTF encoding in pahole [2]. * Originally, Eduard Zingerman suggested to me an approach of creating btf_encoder for each compilation unit, and then merging resulting BTFs in a particular order. * This idea required significant changes in how BTF encoders access ELF information, which led to the "shared elf_functions" patch series [1]. * During review Alan Maguire (and Eduard too, off-list) pointed out [3] that shared elf_functions implementation can be simplified, if infividual function information is split into immutable (ELF part) and mutable (BTF encoding part). Alan also shared a draft of this change [4]. Three commits from that draft are included in this patch series unchanged. At that point I had all the pre-requisites to try "btf_encoder per CU" idea, and so I did [5]. Unfortunately, it turned out that this approach significantly slows down the encoding, although it reduces memory usage for parallel reproducible encoding. Simple measurement script (courtesy of Eduard): #!/bin/bash for j in 1 4 16 64; do /usr/bin/time -f "jobs ${j}, mem %M Kb, time %e sec" \ ./build/pahole -J -j$j \ --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build \ --btf_encode_detached=/dev/null \ --lang_exclude=rust \ $vmlinux > /dev/null done Output for [5]: jobs 1, mem 1225652 Kb, time 10.37 sec jobs 4, mem 1242000 Kb, time 5.44 sec jobs 16, mem 1271692 Kb, time 4.33 sec jobs 64, mem 1410928 Kb, time 4.32 sec Output for pahole/next (98d1f01): jobs 1, mem 3378104 Kb, time 8.76 sec jobs 4, mem 3378716 Kb, time 4.17 sec jobs 16, mem 3378144 Kb, time 4.03 sec jobs 64, mem 3378628 Kb, time 4.05 sec The main reason for this, as far as I understand, is that each individual encoder now needs to do more work. And on top of that, there are more encoders to merge, which increases load on the sequential part of a pahole run. For example, mutable BTF maintains a string set to avoid string duplication, but when an encoder is only concerned with a single CU out of 2k+, this optimization becomes moot. While disappointing, these observations prompted me to look closer on how exacly is pahole spending the time when BTF encoding in parallel [6]. I looked at the version with shared elf_functions table [7] on the assumption that it will land eventually. Here is a very rough high level breakdown (for vmlinux as input): * 81% multithreaded part * 64% loading DWARF * 17% encoding BTF * 20% sequential part * 14% BTF dedup * 6% BTF merge and everything else The current encoding algorithm of each thread goes roughly like this: * each thread has it's own btf_encoder object * a thread reads a CU from DWARF and converts it into the internal representation (struct cu), this is what takes 60%+ of time * created CU is then passed to btf_encoder__encode_cu * when the encoding is complete, CU is deleted (this is important for reducing memory footprint) * when all CUs are processed, exit and proceed to single-threaded BTF merge and dedup Overall, this is pretty fast. Except, when we need the resulting BTF to be deterministic (aka reproducible). Parallel reproducible encoding is implemented by enforcing the order of CU processing between threads. How? Well, a thread is only allowed to encode a CU in CU__LOADED state, and BTF encoding happens under a lock. Which means that the 17% part in the breakdown above is not actually multithreaded with reproducible_build flag. To summarize the observations: * BTF encoding is actually a minor part of the multithreaded work (DWARF loading is about 4x bigger) * BTF encoding requires sequential post-processing (merge and dedup) * BTF processing in libbpf is kinda faster when it's one huge BTF * BTF encoding is sequential with reproducible_build anyway So why try so hard making BTF encoding multithreaded?.. This led me to an idea: what if we leave BTF encoding part sequential, but implement it in a way that does not stall DWARF loading? In other words: let's have a single CU consumer and N-1 CU producers. Producers are slower anyway, and we can also now get rid of the BTF merge almost entirely (except for "add saved functions"). This is implemented by the last patch in this series, and the measurements look promising (see the patch commit message). Test results for this patch series: 1: Validation of BTF encoding of functions; this may take some time: Ok 2: Default BTF on a system without BTF: Ok 3: Flexible arrays accounting: WARNING: still unsuported BTF_KIND_DECL_TAG(bpf_fastcall) for bpf_cast_to_kern_ctx already with attribute (bpf_kfunc), ignoring WARNING: still unsuported BTF_KIND_DECL_TAG(bpf_fastcall) for bpf_rdonly_cast already with attribute (bpf_kfunc), ignoring pahole: type 'nft_pipapo_elem' not found pahole: type 'ip6t_standard' not found pahole: type 'ip6t_error' not found pahole: type 'nft_rbtree_elem' not found pahole: type 'nft_rule_dp_last' not found pahole: type 'nft_bitmap_elem' not found pahole: type 'fuse_direntplus' not found pahole: type 'ipt_standard' not found pahole: type 'ipt_error' not found pahole: type 'tls_rec' not found pahole: type 'nft_rhash_elem' not found pahole: type 'nft_hash_elem' not found Ok 4: Pretty printing of files using DWARF type information: Ok 5: Parallel reproducible DWARF Loading/Serial BTF encoding: Ok [1]: https://lore.kernel.org/dwarves/20241016001025.857970-1-ihor.solodrai@xxxxx/ [2]: https://github.com/theihor/dwarves/pull/3 [3]: https://lore.kernel.org/dwarves/8678ce40-3ce2-4ece-985b-a40427386d57@xxxxxxxxxx/ [4]: https://github.com/acmel/dwarves/compare/master...alan-maguire:dwarves:elf-prep [5]: https://github.com/theihor/dwarves/pull/8 [6]: https://gist.github.com/theihor/f000ce89427828e61fdaa567b332649b [7]: https://github.com/theihor/dwarves/pull/8/commits/a7bc67d79d90f98776c6dc5fdaf9f088eb09909d Alan Maguire (3): btf_encoder: simplify function encoding btf_encoder: store,use section-relative addresses in ELF function representation btf_encoder: separate elf function, saved function representations Ihor Solodrai (6): dwarf_loader: introduce pre_load_module hook to conf_load btf_encoder: introduce elf_functions struct type btf_encoder: collect elf_functions in btf_encoder__pre_load_module btf_encoder: switch to shared elf_functions table btf_encoder: introduce btf_encoding_context pahole: faster reproducible BTF encoding btf_encoder.c | 661 ++++++++++++++++++++++++++++++------------------- btf_encoder.h | 6 + dwarf_loader.c | 18 +- dwarves.c | 47 ++-- dwarves.h | 16 +- pahole.c | 265 +++++++++----------- 6 files changed, 567 insertions(+), 446 deletions(-) -- 2.47.0