On Friday, October 11th, 2024 at 9:52 AM, Ihor Solodrai <ihor.solodrai@xxxxx> wrote: [...] > Hi Alan. Thank you for testing! > > I was going to run a couple more experiments today and respond, but > you beat me to it. > > I am curious about the memory usage too. I'll try measuring how much > is used by btf_encoders and the elf_functions specifically, similar to > how I found the table was big before the function representation > changes [1]. I'll share if I find anything interesting. > > [1] https://lore.kernel.org/dwarves/aQtEzThpIhRiwQpkgeZepPGr5Mt_zuGfPbxQxZgS6SSoPYoqM1afjDqEmIZkdH3YzvhWmWwqCS_8ZvFTcSZHvpkAeBpLRTAZEmrOhq0svfo=@pm.me/ Alan, A heads up: there will definitely be a v3. Eduard discovered a bug when building kernel using pahole with this patch. I already debugged it and made a fix. TL;DR there might be fewer encoders than threads, which I didn't take into account here: diff --git a/pahole.c b/pahole.c index a45aa7a..90f712d 100644 --- a/pahole.c +++ b/pahole.c @@ -3257,29 +3257,34 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void * { struct thread_data **threads = (struct thread_data **)thr_data; struct btf_encoder *encoders[nr_threads]; - int i; + int nr_encoders = 0; int err = 0; + int i; if (error) goto out; - for (i = 0; i < nr_threads; i++) - encoders[i] = threads[i]->encoder; + for (i = 0; i < nr_threads; i++) { + if (threads[i]->encoder) { + encoders[nr_encoders] = threads[i]->encoder; + nr_encoders++; + } + } Regarding memory usage, I tried instrumenting with prints of rusage, but that got too inconvenient quickly. So I ended up making flamegraphs [1] for: perf record -F max -e page-faults -g -- ./pahole -J -j8 \ --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs \ --btf_encode_detached=/dev/null \ --lang_exclude=rust \ ~/repo/bpf-dev-docker/linux/.tmp_vmlinux1 And compared between next (91bcd1d) and this patchset. If you check the number of samples of btf_encoder__collect_symbols() - that's a table per thread vs elf_functions__collect() - a shared table the results are what you might expect: for -j8 shared table is about 6x smaller. However, even the duplicated tables have relatively small memory footprint: ~4-5% of the page faults. The biggest memory eater is elf_getdata() called from btf_encoder__tag_kfuncs(): more than 60%. Maybe that's expected, I don't know. Certainly worth looking into. Of course, these are page-faults not a direct memory measurement. If anyone has suggestion how to get something better, please share. [1]: https://gist.github.com/theihor/64bd4460073724a53d26009e7a474b64