On 11/10/2024 23:55, Ihor Solodrai wrote: > On Friday, October 11th, 2024 at 9:52 AM, Ihor Solodrai <ihor.solodrai@xxxxx> wrote: > > [...] > >> Hi Alan. Thank you for testing! >> >> I was going to run a couple more experiments today and respond, but >> you beat me to it. >> >> I am curious about the memory usage too. I'll try measuring how much >> is used by btf_encoders and the elf_functions specifically, similar to >> how I found the table was big before the function representation >> changes [1]. I'll share if I find anything interesting. >> >> [1] https://lore.kernel.org/dwarves/aQtEzThpIhRiwQpkgeZepPGr5Mt_zuGfPbxQxZgS6SSoPYoqM1afjDqEmIZkdH3YzvhWmWwqCS_8ZvFTcSZHvpkAeBpLRTAZEmrOhq0svfo=@pm.me/ > > Alan, > > A heads up: there will definitely be a v3. > > Eduard discovered a bug when building kernel using pahole with this > patch. I already debugged it and made a fix. > > TL;DR there might be fewer encoders than threads, which I didn't take > into account here: > > diff --git a/pahole.c b/pahole.c > index a45aa7a..90f712d 100644 > --- a/pahole.c > +++ b/pahole.c > @@ -3257,29 +3257,34 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void * > { > struct thread_data **threads = (struct thread_data **)thr_data; > struct btf_encoder *encoders[nr_threads]; > - int i; > + int nr_encoders = 0; > int err = 0; > + int i; > > if (error) > goto out; > > - for (i = 0; i < nr_threads; i++) > - encoders[i] = threads[i]->encoder; > + for (i = 0; i < nr_threads; i++) { > + if (threads[i]->encoder) { > + encoders[nr_encoders] = threads[i]->encoder; > + nr_encoders++; > + } > + } > I tried the above, but still saw segmentation faults; the following was needed at my end to get module BTF working: iff --git a/pahole.c b/pahole.c index a45aa7a..bb17aec 100644 --- a/pahole.c +++ b/pahole.c @@ -3257,20 +3257,25 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void * { struct thread_data **threads = (struct thread_data **)thr_data; struct btf_encoder *encoders[nr_threads]; - int i; + int nr_encoders = 0; int err = 0; + int i; if (error) goto out; - for (i = 0; i < nr_threads; i++) - encoders[i] = threads[i]->encoder; + for (i = 0; i < nr_threads; i++) { + if (threads[i]->encoder && threads[i]->encoder != btf_encoder) { + encoders[nr_encoders] = threads[i]->encoder; + nr_encoders++; + } + } /* * Merge content of the btf instances of worker threads to the btf * instance of the primary btf_encoder. */ - err = btf_encoder__merge_encoders(btf_encoder, encoders, nr_threads); + err = btf_encoder__merge_encoders(btf_encoder, encoders, nr_encoders); if (err < 0) goto out; err = 0; In particular, nr_encoders must be used at btf_encoder__merge_encoders(). With this in place, I hit another segmentation fault, due this time to the fact that we need to handle the case where there are no ELF functions (this occurs for one module when building the kernel). Fix is below: diff --git a/btf_encoder.c b/btf_encoder.c index 07d2031..b61a89c 100644 --- a/btf_encoder.c +++ b/btf_encoder.c @@ -2274,6 +2274,8 @@ static int elf_functions__collect(struct elf_functions *functions) functions->cnt, sizeof(functions->entries[0]), functions_cmp); + else + return 0; // Reallocate to the exact size functions->entries = realloc(functions->entries, functions->cnt * sizeof(struct elf_function)); > > Regarding memory usage, I tried instrumenting with prints of rusage, > but that got too inconvenient quickly. > > So I ended up making flamegraphs [1] for: > > perf record -F max -e page-faults -g -- > ./pahole -J -j8 \ > --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs \ > --btf_encode_detached=/dev/null \ > --lang_exclude=rust \ > ~/repo/bpf-dev-docker/linux/.tmp_vmlinux1 > > And compared between next (91bcd1d) and this patchset. > > If you check the number of samples of > btf_encoder__collect_symbols() - that's a table per thread > vs > elf_functions__collect() - a shared table > > the results are what you might expect: for -j8 shared table is about > 6x smaller. > > However, even the duplicated tables have relatively small memory > footprint: ~4-5% of the page faults. > > The biggest memory eater is elf_getdata() called from > btf_encoder__tag_kfuncs(): more than 60%. Maybe that's expected, I > don't know. Certainly worth looking into. > > Of course, these are page-faults not a direct memory measurement. > If anyone has suggestion how to get something better, please share. > > [1]: https://gist.github.com/theihor/64bd4460073724a53d26009e7a474b64 >