On Mon, Nov 11, 2024 at 12:01:13AM -0800, Eduard Zingerman wrote: > On Sun, 2024-11-10 at 03:38 -0800, Eduard Zingerman wrote: > > [...] > > > Also, it appears there is some bug either in pahole or in libdw's > > implementation of dwarf_getlocation(). When I try both your patch-set > > and my variant there is a segfault once in a while: > > > > $ for i in $(seq 1 100); \ > > do echo "---> $i"; \ > > pahole -j --skip_encoding_btf_inconsistent_proto -J --btf_encode_detached=/dev/null vmlinux ; \ > > done > > ---> 1 > > ... > > ---> 71 > > Segmentation fault (core dumped) > > ... > > > > The segfault happens only when -j (multiple threads) is passed. > > If pahole is built with sanitizers > > (passing -DCMAKE_C_FLAGS="-fsanitize=undefined,address") > > the stack trace looks as follows: > > Did some additional research for these SEGFAULTs. > Looks like all we are in trouble. > > # TLDR > > libdw is not supposed to be used in a concurrent context. > libdw is a part of elfutils package, the configuration flag > making API thread-safe is documented as experimental: > --enable-thread-safety enable thread safety of libraries EXPERIMENTAL > At-least Fedora 40 does not ship elfutils built with this flag set. > This colours all current parallel DWARF decoding questionable. > > # Why segfault happens > > Any references to elfutils source code are for commit [1]. > The dwarf_getlocation() is one of a few libdw APIs that uses memory > allocation internally. The function dwarf_getlocation.c:__libdw_intern_expression > iterates over expression encodings in DWARF and allocates > a set of objects of type `struct loclist` and `Dwarf_Op`. > Pointers to allocated objects are put to a binary tree for caching, > see dwarf_getlocation.c:660, the call to eu_tsearch() function. > The eu_tsearch() is a wrapper around libc tsearch() function. > This wrapper provides locking for the tree, > but only if --enable-thread-safety was set during elfutils configuration. > The SEGFAULT happens inside tsearch() call because binary tree is malformed, e.g.: > > Thread 8 "pahole" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffd9c006c0 (LWP 2630074)] > 0x00007ffff7c5d200 in maybe_split_for_insert (...) at tsearch.c:228 > 228 if (parentp != NULL && RED(DEREFNODEPTR(parentp))) > (gdb) bt > #0 0x00007ffff7c5d200 in maybe_split_for_insert (...) at tsearch.c:228 > #1 0x00007ffff7c5d466 in __GI___tsearch (...) at tsearch.c:358 > #2 __GI___tsearch (...) at tsearch.c:290 > #3 0x000000000048f096 in __interceptor_tsearch () > #4 0x00007ffff7f5c482 in __libdw_intern_expression (...) at dwarf_getlocation.c:660 > #5 0x00007ffff7f5cf51 in getlocation (...) at dwarf_getlocation.c:678 > #6 getlocation (...) at dwarf_getlocation.c:667 > #7 dwarf_getlocation (..._ at dwarf_getlocation.c:708 > #8 0x00000000005a2ee5 in parameter.new () > #9 0x00000000005a0122 in die.process_function () > #10 0x0000000000597efd in __die__process_tag () > #11 0x0000000000595ad9 in die.process_unit () > #12 0x0000000000595436 in die.process () > #13 0x00000000005b0187 in dwarf_cus.process_cu () > #14 0x00000000005afa38 in dwarf_cus.process_cu_thread () > #15 0x00000000004c7b8d in asan_thread_start(void*) () > #16 0x00007ffff7bda6d7 in start_thread (arg=<optimized out>) at pthread_create.c:447 > #17 0x00007ffff7c5e60c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 > (gdb) p parentp > $1 = (node *) 0x50300079d2a0 > (gdb) p *parentp > $2 = (node) 0x0 > > glibc provides a way to validate binary tree structure. > For this misc/tsearch.c has to be changed to define DEBUGGING variable. > (I used glibc 2.39 as provided by source rpm for Fedora 40 for experiments). > If this is done and custom glibc is used for pahole execution, > the following error is reported if '-j' flag is present: > > $ pahole -j --skip_encoding_btf_inconsistent_proto -J --btf_encode_detached=/home/eddy/work/tmp/my-new.btf vmlinux > Fatal glibc error: tsearch.c:164 (check_tree_recurse): assertion failed: d_sofar == d_total > Fatal glibc error: tsearch.c:164 (check_tree_recurse): assertion failed: d_sofar == d_total > Aborted (core dumped) > > Executing pahole using a custom-built libdw, > built with --enable-thread-safety resolves the issue. could we use libdw__lock around that? but I guess we use it on other places as well.. jirka > > [1] b2f225d6bff8 ("Consolidate and add files to clean target variables") > git://sourceware.org/git/elfutils.git > >