Re: [PATCH v2 dwarves 0/5] btf_encoder: implement shared elf_functions table

Ihor Solodrai <ihor.solodrai@xxxxx> · Fri, 11 Oct 2024 22:55:44 +0000

On Friday, October 11th, 2024 at 9:52 AM, Ihor Solodrai <ihor.solodrai@xxxxx> wrote:

[...]

> Hi Alan. Thank you for testing!
> 
> I was going to run a couple more experiments today and respond, but
> you beat me to it.
> 
> I am curious about the memory usage too. I'll try measuring how much
> is used by btf_encoders and the elf_functions specifically, similar to
> how I found the table was big before the function representation
> changes [1]. I'll share if I find anything interesting.
> 
> [1] https://lore.kernel.org/dwarves/aQtEzThpIhRiwQpkgeZepPGr5Mt_zuGfPbxQxZgS6SSoPYoqM1afjDqEmIZkdH3YzvhWmWwqCS_8ZvFTcSZHvpkAeBpLRTAZEmrOhq0svfo=@pm.me/

Alan,

A heads up: there will definitely be a v3.

Eduard discovered a bug when building kernel using pahole with this
patch. I already debugged it and made a fix.

TL;DR there might be fewer encoders than threads, which I didn't take
into account here:

diff --git a/pahole.c b/pahole.c
index a45aa7a..90f712d 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3257,29 +3257,34 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
 {
        struct thread_data **threads = (struct thread_data **)thr_data;
        struct btf_encoder *encoders[nr_threads];
-       int i;
+       int nr_encoders = 0;
        int err = 0;
+       int i;
 
        if (error)
                goto out;
 
-       for (i = 0; i < nr_threads; i++)
-               encoders[i] = threads[i]->encoder;
+       for (i = 0; i < nr_threads; i++) {
+               if (threads[i]->encoder) {
+                       encoders[nr_encoders] = threads[i]->encoder;
+                       nr_encoders++;
+               }
+       }


Regarding memory usage, I tried instrumenting with prints of rusage,
but that got too inconvenient quickly.

So I ended up making flamegraphs [1] for:

  perf record -F max -e page-faults -g --
   ./pahole -J -j8 \
            --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs \
            --btf_encode_detached=/dev/null \
            --lang_exclude=rust \
            ~/repo/bpf-dev-docker/linux/.tmp_vmlinux1

And compared between next (91bcd1d) and this patchset.

If you check the number of samples of 
    btf_encoder__collect_symbols() - that's a table per thread
vs 
    elf_functions__collect() - a shared table

the results are what you might expect: for -j8 shared table is about
6x smaller.

However, even the duplicated tables have relatively small memory
footprint: ~4-5% of the page faults.

The biggest memory eater is elf_getdata() called from
btf_encoder__tag_kfuncs(): more than 60%. Maybe that's expected, I
don't know. Certainly worth looking into.

Of course, these are page-faults not a direct memory measurement.
If anyone has suggestion how to get something better, please share.

[1]: https://gist.github.com/theihor/64bd4460073724a53d26009e7a474b64