On 11/10/2024 23:55, Ihor Solodrai wrote: > On Friday, October 11th, 2024 at 9:52 AM, Ihor Solodrai <ihor.solodrai@xxxxx> wrote: > > [...] > >> Hi Alan. Thank you for testing! >> >> I was going to run a couple more experiments today and respond, but >> you beat me to it. >> >> I am curious about the memory usage too. I'll try measuring how much >> is used by btf_encoders and the elf_functions specifically, similar to >> how I found the table was big before the function representation >> changes [1]. I'll share if I find anything interesting. >> >> [1] https://lore.kernel.org/dwarves/aQtEzThpIhRiwQpkgeZepPGr5Mt_zuGfPbxQxZgS6SSoPYoqM1afjDqEmIZkdH3YzvhWmWwqCS_8ZvFTcSZHvpkAeBpLRTAZEmrOhq0svfo=@pm.me/ > > Alan, > > A heads up: there will definitely be a v3. > > Eduard discovered a bug when building kernel using pahole with this > patch. I already debugged it and made a fix. > > TL;DR there might be fewer encoders than threads, which I didn't take > into account here: > > diff --git a/pahole.c b/pahole.c > index a45aa7a..90f712d 100644 > --- a/pahole.c > +++ b/pahole.c > @@ -3257,29 +3257,34 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void * > { > struct thread_data **threads = (struct thread_data **)thr_data; > struct btf_encoder *encoders[nr_threads]; > - int i; > + int nr_encoders = 0; > int err = 0; > + int i; > > if (error) > goto out; > > - for (i = 0; i < nr_threads; i++) > - encoders[i] = threads[i]->encoder; > + for (i = 0; i < nr_threads; i++) { > + if (threads[i]->encoder) { > + encoders[nr_encoders] = threads[i]->encoder; > + nr_encoders++; > + } > + } > > > Regarding memory usage, I tried instrumenting with prints of rusage, > but that got too inconvenient quickly. > > So I ended up making flamegraphs [1] for: > > perf record -F max -e page-faults -g -- > ./pahole -J -j8 \ > --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs \ > --btf_encode_detached=/dev/null \ > --lang_exclude=rust \ > ~/repo/bpf-dev-docker/linux/.tmp_vmlinux1 > > And compared between next (91bcd1d) and this patchset. > > If you check the number of samples of > btf_encoder__collect_symbols() - that's a table per thread > vs > elf_functions__collect() - a shared table > > the results are what you might expect: for -j8 shared table is about > 6x smaller. > > However, even the duplicated tables have relatively small memory > footprint: ~4-5% of the page faults. > > The biggest memory eater is elf_getdata() called from > btf_encoder__tag_kfuncs(): more than 60%. Maybe that's expected, I > don't know. Certainly worth looking into. > Thanks for this analysis! I'll dig into it a bit more, but one saving we can make is use ELF_C_READ_MMAP here; in my measurements it saves between 10-30% in peak memory utilization during vmlinux BTF encoding by mmap()ing rather than pre-malloc()ing memory for the ELF sections. I've posted a patch with that [1]; maybe it will help us see the benefits of your series more clearly. [1] https://lore.kernel.org/dwarves/20241014103142.2280044-1-alan.maguire@xxxxxxxxxx/ > Of course, these are page-faults not a direct memory measurement. > If anyone has suggestion how to get something better, please share. > > [1]: https://gist.github.com/theihor/64bd4460073724a53d26009e7a474b64 >