In dwarf_loader with growing nr_jobs the wall-clock time of BTF encoding starts worsening after a certain point [1]. While some overhead of additional threads is expected, it's not supposed to be noticeable unless nr_jobs is set to an unreasonably big value. It turns out when there are "too many" threads decoding DWARF, they start competing for memory allocation: significant number of cycles is spent in osq_lock - in the depth of malloc called within cu__zalloc. Which suggests that many threads are trying to allocate memory at the same time. See an example on a perf flamegraph for run with -j240 [2]. This is 12-core machine, so the effect is small. On machines with more cores this problem is worse. Increasing the chunk size of obstacks associated with CUs helps to reduce the performance penalty caused by this race condition. [1] https://lore.kernel.org/dwarves/C82bYTvJaV4bfT15o25EsBiUvFsj5eTlm17933Hvva76CXjIcu3gvpaOCWPgeZ8g3cZ-RMa8Vp0y1o_QMR2LhPB-LEUYfZCGuCfR_HvkIP8=@pm.me/ [2] https://gist.github.com/theihor/926af22417a78605fec8d85e1338920e Signed-off-by: Ihor Solodrai <ihor.solodrai@xxxxx> --- dwarves.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/dwarves.c b/dwarves.c index 7c3e878..105f81a 100644 --- a/dwarves.c +++ b/dwarves.c @@ -722,6 +722,8 @@ int cu__fprintf_ptr_table_stats_csv(struct cu *cu, FILE *fp) return printed; } +#define OBSTACK_CHUNK_SIZE (128*1024) + struct cu *cu__new(const char *name, uint8_t addr_size, const unsigned char *build_id, int build_id_len, const char *filename, bool use_obstack) @@ -733,7 +735,7 @@ struct cu *cu__new(const char *name, uint8_t addr_size, cu->use_obstack = use_obstack; if (cu->use_obstack) - obstack_init(&cu->obstack); + obstack_begin(&cu->obstack, OBSTACK_CHUNK_SIZE); if (name == NULL || filename == NULL) goto out_free; -- 2.47.1