+cc bpf@xxxxxxxxxxxxxxx On Wed, Jan 19, 2022 at 5:08 PM Kui-Feng Lee <kuifeng@xxxxxx> wrote: > > Creating an instance of btf for each worker thread allows > steal-function provided by pahole to add type info on multiple threads > without a lock. The main thread merges the results of worker threads > to the primary instance. > > Copying data from per-thread btf instances to the primary instance is > expensive now. However, there is a patch landed at the bpf-next > repository. [1] With the patch for bpf-next and this patch, they drop > total runtime to 5.4s from 6.0s with "-j4" on my device to generate > BTF for Linux. Just a few more data points. I've tried this locally with 40 cores, both with and without the libbpf's btf__add_btf() optimization. BASELINE NON-PARALLEL ===================== $ time ./pahole -J ~/linux-build/default/vmlinux ./pahole -J ~/linux-build/default/vmlinux 11.17s user 0.66s system 99% cpu 11.832 total BASELINE PARALLEL ================= $ time ./pahole -j40 -J ~/linux-build/default/vmlinux ./pahole -j40 -J ~/linux-build/default/vmlinux 13.85s user 0.75s system 290% cpu 5.023 total THESE PATCHES WITHOUT LIBBPF SPEED-UP ===================================== $ time ./pahole -j40 -J ~/linux-build/default/vmlinux ./pahole -j40 -J ~/linux-build/default/vmlinux 25.94s user 1.15s system 685% cpu 3.954 total THESE PATCHES WITH LATEST LIBBPF SPEED-UP ========================================= $ time ./pahole -j40 -J ~/linux-build/default/vmlinux ./pahole -j40 -J ~/linux-build/default/vmlinux 27.49s user 1.08s system 858% cpu 3.328 total So on 40 cores, it's a speed up from 11.8 seconds non-parallel, to 5s parallel without Kui-Feng's changes, to 4s with Kui-Feng's changes, to 3.3s after libbpf update (I did it locally, will sync this to Github today). 4x speed up, not bad! But parallel mode is not currently enabled in kernel build, let's enable parallel mode and save those seconds during the kernel build! > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=d81283d27266 > > Kui-Feng Lee (2): > dwarf_loader: Prepare and pass per-thread data to worker threads. > pahole: Use per-thread btf instances to avoid mutex locking. > > btf_encoder.c | 5 +++ > btf_encoder.h | 2 + > btf_loader.c | 2 +- > ctf_loader.c | 2 +- > dwarf_loader.c | 58 ++++++++++++++++++------ > dwarves.h | 9 +++- > pahole.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++--- > pdwtags.c | 3 +- > pfunct.c | 4 +- > 9 files changed, 180 insertions(+), 25 deletions(-) > > -- > 2.30.2 >