Re: [PATCH v2 dwarves 0/5] btf_encoder: implement shared elf_functions table

Alan Maguire <alan.maguire@xxxxxxxxxx> · Mon, 14 Oct 2024 11:33:16 +0100

On 11/10/2024 23:55, Ihor Solodrai wrote:
> On Friday, October 11th, 2024 at 9:52 AM, Ihor Solodrai <ihor.solodrai@xxxxx> wrote:
> 
> [...]
> 
>> Hi Alan. Thank you for testing!
>>
>> I was going to run a couple more experiments today and respond, but
>> you beat me to it.
>>
>> I am curious about the memory usage too. I'll try measuring how much
>> is used by btf_encoders and the elf_functions specifically, similar to
>> how I found the table was big before the function representation
>> changes [1]. I'll share if I find anything interesting.
>>
>> [1] https://lore.kernel.org/dwarves/aQtEzThpIhRiwQpkgeZepPGr5Mt_zuGfPbxQxZgS6SSoPYoqM1afjDqEmIZkdH3YzvhWmWwqCS_8ZvFTcSZHvpkAeBpLRTAZEmrOhq0svfo=@pm.me/
> 
> Alan,
> 
> A heads up: there will definitely be a v3.
> 
> Eduard discovered a bug when building kernel using pahole with this
> patch. I already debugged it and made a fix.
> 
> TL;DR there might be fewer encoders than threads, which I didn't take
> into account here:
> 
> diff --git a/pahole.c b/pahole.c
> index a45aa7a..90f712d 100644
> --- a/pahole.c
> +++ b/pahole.c
> @@ -3257,29 +3257,34 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
>  {
>         struct thread_data **threads = (struct thread_data **)thr_data;
>         struct btf_encoder *encoders[nr_threads];
> -       int i;
> +       int nr_encoders = 0;
>         int err = 0;
> +       int i;
>  
>         if (error)
>                 goto out;
>  
> -       for (i = 0; i < nr_threads; i++)
> -               encoders[i] = threads[i]->encoder;
> +       for (i = 0; i < nr_threads; i++) {
> +               if (threads[i]->encoder) {
> +                       encoders[nr_encoders] = threads[i]->encoder;
> +                       nr_encoders++;
> +               }
> +       }
> 
> 
> Regarding memory usage, I tried instrumenting with prints of rusage,
> but that got too inconvenient quickly.
> 
> So I ended up making flamegraphs [1] for:
> 
>   perf record -F max -e page-faults -g --
>    ./pahole -J -j8 \
>             --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs \
>             --btf_encode_detached=/dev/null \
>             --lang_exclude=rust \
>             ~/repo/bpf-dev-docker/linux/.tmp_vmlinux1
> 
> And compared between next (91bcd1d) and this patchset.
> 
> If you check the number of samples of 
>     btf_encoder__collect_symbols() - that's a table per thread
> vs 
>     elf_functions__collect() - a shared table
> 
> the results are what you might expect: for -j8 shared table is about
> 6x smaller.
> 
> However, even the duplicated tables have relatively small memory
> footprint: ~4-5% of the page faults.
> 
> The biggest memory eater is elf_getdata() called from
> btf_encoder__tag_kfuncs(): more than 60%. Maybe that's expected, I
> don't know. Certainly worth looking into.
>

Thanks for this analysis! I'll dig into it a bit more, but one saving we
can make is use ELF_C_READ_MMAP here; in my measurements it saves
between 10-30% in peak memory utilization during vmlinux BTF encoding by
mmap()ing rather than pre-malloc()ing memory for the ELF sections. I've
posted a patch with that [1]; maybe it will help us see the benefits of
your series more clearly.

[1]
https://lore.kernel.org/dwarves/20241014103142.2280044-1-alan.maguire@xxxxxxxxxx/

> Of course, these are page-faults not a direct memory measurement.
> If anyone has suggestion how to get something better, please share.
> 
> [1]: https://gist.github.com/theihor/64bd4460073724a53d26009e7a474b64
>