Em Mon, Jul 24, 2023 at 01:12:43PM -0700, Ian Rogers escreveu: > Add a build flag, LTO=1, so that perf is built with the -flto > flag. Address some build errors this configuration throws up. > > For me on my Debian derived OS, "CC=clang CXX=clang++ LD=ld.lld" works > fine. With GCC LTO this fails with: > ``` > lto-wrapper: warning: using serial compilation of 50 LTRANS jobs > lto-wrapper: note: see the ‘-flto’ option documentation for more information > /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel.ro+0x28): undefined reference to `memset_orig' > /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel.ro+0x40): undefined reference to `__memset' > /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel+0x28): undefined reference to `memcpy_orig' > /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel+0x40): undefined reference to `__memcpy' > /usr/bin/ld: /tmp/ccK8kXAu.ltrans44.ltrans.o: in function `test__arch_unwind_sample': > /home/irogers/kernel.org/tools/perf/arch/x86/tests/dwarf-unwind.c:72: undefined reference to `perf_regs_load' > collect2: error: ld returned 1 exit status > ``` > > The issue is that we build multiple .o files in a directory and then > link them into a .o with "ld -r" (cmd_ld_multi). This early link step > appears to trigger GCC to remove the .S file definition of the symbol > and break the later link step (the perf-in.o shows perf_regs_load, for > example, going from the text section to being undefined at the link > step which doesn't happen with clang or without LTO). It is possible > to work around this by taking the final perf link command and adding > the .o files generated from .S back into it, namely: > arch/x86/tests/regs_load.o > bench/mem-memset-x86-64-asm.o > bench/mem-memcpy-x86-64-asm.o > > A quick performance check and the performance improvements from LTO > are noticeable: > > Non-LTO > ``` > $ perf bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 202.216 usec (+- 0.160 usec) > Average num. events: 51.000 (+- 0.000) > Average time per event 3.965 usec > Average data synthesis took: 230.875 usec (+- 0.285 usec) > Average num. events: 271.000 (+- 0.000) > Average time per event 0.852 usec > ``` > > LTO > ``` > $ perf bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 104.530 usec (+- 0.074 usec) > Average num. events: 51.000 (+- 0.000) > Average time per event 2.050 usec > Average data synthesis took: 112.660 usec (+- 0.114 usec) > Average num. events: 273.000 (+- 0.000) > Average time per event 0.413 usec Cool stuff! Applied locally, test building now on the container suite. - Arnaldo > ``` > > Ian Rogers (4): > perf stat: Avoid uninitialized use of perf_stat_config > perf parse-events: Avoid use uninitialized warning > perf test: Avoid weak symbol for arch_tests > perf build: Add LTO build option > > tools/perf/Makefile.config | 5 +++++ > tools/perf/tests/builtin-test.c | 11 ++++++++++- > tools/perf/tests/stat.c | 2 +- > tools/perf/util/parse-events.c | 2 +- > tools/perf/util/stat.c | 2 +- > 5 files changed, 18 insertions(+), 4 deletions(-) > > -- > 2.41.0.487.g6d72f3e995-goog > -- - Arnaldo