On 12/5/24 9:45 AM, Miro Hrončok wrote: > On 04. 12. 24 20:32, William Cohen wrote: >> On 11/21/24 17:32, Miro Hrončok wrote: >>> On 21. 11. 24 23:11, William Cohen wrote: >>>> Sediment has been designed to work with the RPM build process. >>>> Currently, one needs to use modified RPM macros. These can be created >>>> quickly by writing the output of the sediment make_sediment_rpmmacros >>>> command into ~/.rpmmacros. One will also need to define set the pgo >>>> macro to 1 for the rpmbuild process. The rpm spec file has minimal >>>> modifications. It has the callgraph files stored as a source file and >>>> a defines the global call_graph to the source file that holds the call >>>> graph. >>> >>> Hey Will, >>> >>> let's say I wan to try this for Python. Where do I start? The README on https://github.com/wcohen/sediment is not very helpful. >>> >>> This is what I did based on your email: >>> >>> $ sudo dnf --enable-repo=updates-testing install sediment >>> ... >>> Installing sediment-0:0.9.3-1.fc41.noarch >>> >>> I run make_sediment_rpmmacros, it gives me some macros. Now I am supposed to put those to ~/.rpmmacros. Exccept I never build Python loclly, I use Koji or mock. I can probably amend this to use %global and insert it to python3.14.spec. But what else I need to do? Do you have a step by step kind of document I can follow? >>> >> >> >> Hi Miro, >> >> The tooling doesn't yet fit your work flow of building packages in >> koji and mock. I am looking into ways of addressing that issue. >> >> I an earlier email I mentioned the important thing was have good >> profiling data. Do you have suggestions on some benchmarks that would >> properly exercise the python interpreter? I have used pyperformance >> (https://github.com/python/pyperformance) to get some call graph data >> for python and added that to a python3.13 srpm available at >> https://koji.fedoraproject.org/koji/taskinfo?taskID=126526066. ; Note >> Koji is NOT building code layout optimization. One would still need >> to build locally python3.13-3.13.0-1.fc41.src.rpm with sediment-0.9.4 >> (https://koji.fedoraproject.org/koji/buildinfo?buildID=2596791) >> installed and ~/.rpmmacros following steps: >> >> make_sediment_rpmmacros > ~/.rpmmacros >> rpm -Uvh python3.13-3.13.0-1.fc41.src.rpm >> cd ~/rpmbuild/SPECS >> rpmbuild -ba --define "pgo 1" python3.13.spec >> >> The notable difference in the python3.13.spec file is the addition of: >> >> # Call graph information >> SOURCE12: perf_pybenchmark.gv >> %global call_graph %{SOURCE12} >> >> The perf_pybenchmark.gv was generated with steps: >> >> python3 -m pip install pyperformance >> perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json >> perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out >> perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv >> >> Added the file to the python srpm: >> >> cp perf_pybenchmark.gv ~/rpmbuild/SOURCES/. >> # edit ~/rpmbuild/SPECS/python3.13.spec to add call graph info >> The improvements were mixed between the code layout optimized python >> and the baseline version of the pyperformance benchmarks. This can be >> seen in the attached python_pgo.out generated by: >> >> python3 -m pyperf compare_to fc41_x86_python_baseline.json fc41_x86_python_pgo.json --table > python_pgo.out >> >> It looks like a number of the benchmarks are microbenchmarks that are >> unlikely the benefit much for the code layout optimizations. >> >> Are there other python performance tests that you would suggest that >> have have larger footprint and would better gauge the possible >> performance improvement from the code layout optimization? >> >> Are there better python code examples to collect profiling data on? > Hey Will, > > thanks for looking into this. > > For your question: Upstream is using this for PGO: > > $ python3.14 -m test --pgo > > Or: > > $ python3.14 -m test --pgo-extended > > In spec, this can be used: > > LD_LIBRARY_PATH=./build/optimized ./build/optimized/python -m test ... > > --- > > What is the blocker to run this in Koji/mock? > > You do `make_sediment_rpmmacros > ~/.rpmmacros`. > > What's the issue with %defining such macros at spec level? > Hi, I was able to do some experiments with the koji/mock buildable python3.13-3.13.0-1.fc41_opt.src.rpm (https://koji.fedoraproject.org/koji/taskinfo?taskID=128437060) and get better measurements of the performance impact With vstinner's suggestions for doing profiling of python. On a Lenovo P51 laptop running Fedora 41 I built two versions of rpms. Training data collected on pyperformance run and analyzed using sediment tool with: python3 -m pip install pyperformance perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv Installed the srpm, went into the SPECS directory, and built the code layout optimized RPMs (have an added _opt in the names) with: rpm -Uvh python3.13-3.13.0-1.fc41_opt.src.rpm cd ~/rpmbuild/SPECS rpmbuild -ba python3.13.spec Built RPMs without the code-layout optimization (no _opt in the RPM names): rpmbuild --without opt -ba python3.13.spec Installed the code-layout RPMs, set up the environment for benchmarking, and ran the tests: sudo dnf install ~/rpmbuild/RPMS/x86_64/python*fc41_opt* ~/rpmbuild/RPMS/noarch/python-unversioned-command-3.13.0-1.fc41_opt.noarch.rpm sudo python3 -m pyperf system tune pyperformance run -f -o fc41_x86_python_opt20250131.json >& fc41_pybench_opt_20250131.log Then collected data for the non-optimized version of the rpms: sudo dnf install ~/rpmbuild/RPMS/x86_64/python*fc41.* ~/rpmbuild/RPMS/noarch/python-unversioned-command-3.13.0-1.fc41.noarch.rpm sudo python3 -m pyperf system tune pyperformance run -f -o fc41_x86_python_20250131.json >& fc41_pybench_20250131.log Once done compared the data between the runs with: python3 -m pyperf compare_to fc41_x86_python_20250131.json fc41_x86_python_opt20250131.json --table > python_opt.out Below is the comparison between the two versions python_opt.out). For the vast majority of the benchmarks the optimized code is slightly faster typically (1%). The regex_* benchmarks appeared to be the largest benefit with regex_dna being 1.04x faster. There are several benchmarks that are slightly slower, pickle, pickle_dict, create_gc_cycles, spectral_norm, and typing_runtime_protocols. The unpack_sequence was the worst, being 1.12x slower for the optimized code. The improvements are not as noticeable as what was seen with postgresql. I suspect that this might be due to the pyperformance has microbenchmarks and is not putting as much pressure on the iTLB as the large postgresql binary. Benchmarks with tag 'apps': =========================== +----------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +================+==========================+=============================+ | 2to3 | 378 ms | 374 ms: 1.01x faster | +----------------+--------------------------+-----------------------------+ | chameleon | 10.2 ms | 10.1 ms: 1.01x faster | +----------------+--------------------------+-----------------------------+ | docutils | 3.56 sec | 3.52 sec: 1.01x faster | +----------------+--------------------------+-----------------------------+ | html5lib | 91.4 ms | 90.4 ms: 1.01x faster | +----------------+--------------------------+-----------------------------+ | tornado_http | 180 ms | 178 ms: 1.01x faster | +----------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.01x faster | +----------------+--------------------------+-----------------------------+ Benchmarks with tag 'asyncio': ============================== +---------------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +=====================+==========================+=============================+ | async_tree_none | 462 ms | 459 ms: 1.01x faster | +---------------------+--------------------------+-----------------------------+ | async_tree_eager | 159 ms | 158 ms: 1.01x faster | +---------------------+--------------------------+-----------------------------+ | async_tree_eager_tg | 104 ms | 102 ms: 1.01x faster | +---------------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.01x faster | +---------------------+--------------------------+-----------------------------+ Benchmark hidden because not significant (13): async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager_cpu _io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, asy nc_tree_none_tg Benchmarks with tag 'math': =========================== +----------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +================+==========================+=============================+ | pidigits | 249 ms | 249 ms: 1.00x faster | +----------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.00x faster | +----------------+--------------------------+-----------------------------+ Benchmark hidden because not significant (2): float, nbody Benchmarks with tag 'regex': ============================ +----------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +================+==========================+=============================+ | regex_compile | 197 ms | 194 ms: 1.01x faster | +----------------+--------------------------+-----------------------------+ | regex_dna | 253 ms | 242 ms: 1.04x faster | +----------------+--------------------------+-----------------------------+ | regex_effbot | 4.60 ms | 4.49 ms: 1.02x faster | +----------------+--------------------------+-----------------------------+ | regex_v8 | 33.6 ms | 31.9 ms: 1.05x faster | +----------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.03x faster | +----------------+--------------------------+-----------------------------+ Benchmarks with tag 'serialize': ================================ +----------------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +======================+==========================+=============================+ | json_loads | 37.2 us | 36.8 us: 1.01x faster | +----------------------+--------------------------+-----------------------------+ | pickle | 14.9 us | 15.1 us: 1.01x slower | +----------------------+--------------------------+-----------------------------+ | pickle_dict | 40.9 us | 41.9 us: 1.02x slower | +----------------------+--------------------------+-----------------------------+ | pickle_pure_python | 428 us | 420 us: 1.02x faster | +----------------------+--------------------------+-----------------------------+ | tomli_loads | 3.02 sec | 2.96 sec: 1.02x faster | +----------------------+--------------------------+-----------------------------+ | unpickle_pure_python | 307 us | 305 us: 1.01x faster | +----------------------+--------------------------+-----------------------------+ | xml_etree_process | 84.8 ms | 84.3 ms: 1.01x faster | +----------------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.00x slower | +----------------------+--------------------------+-----------------------------+ Benchmark hidden because not significant (7): json_dumps, pickle_list, unpickle, unpickle_list, xml_etree_parse, xml_etr ee_iterparse, xml_etree_generate Benchmarks with tag 'startup': ============================== +------------------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +========================+==========================+=============================+ | python_startup | 15.5 ms | 15.5 ms: 1.00x faster | +------------------------+--------------------------+-----------------------------+ | python_startup_no_site | 10.2 ms | 10.2 ms: 1.00x faster | +------------------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.00x faster | +------------------------+--------------------------+-----------------------------+ Benchmarks with tag 'template': =============================== +-----------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +=================+==========================+=============================+ | django_template | 53.3 ms | 51.9 ms: 1.03x faster | +-----------------+--------------------------+-----------------------------+ | genshi_text | 34.5 ms | 33.7 ms: 1.02x faster | +-----------------+--------------------------+-----------------------------+ | genshi_xml | 75.5 ms | 73.5 ms: 1.03x faster | +-----------------+--------------------------+-----------------------------+ | mako | 17.0 ms | 16.8 ms: 1.01x faster | +-----------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.02x faster | +-----------------+--------------------------+-----------------------------+ All benchmarks: =============== +--------------------------+--------------------------+-----------------------------+ | Benchmark | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 | +==========================+==========================+=============================+ | 2to3 | 378 ms | 374 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | async_tree_none | 462 ms | 459 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | async_tree_eager | 159 ms | 158 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | async_tree_eager_tg | 104 ms | 102 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | asyncio_tcp_ssl | 1.84 sec | 1.83 sec: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | chameleon | 10.2 ms | 10.1 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | chaos | 84.6 ms | 83.6 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | coroutines | 33.4 ms | 32.8 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | crypto_pyaes | 105 ms | 103 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | deepcopy | 545 us | 530 us: 1.03x faster | +--------------------------+--------------------------+-----------------------------+ | deepcopy_reduce | 4.81 us | 4.68 us: 1.03x faster | +--------------------------+--------------------------+-----------------------------+ | deepcopy_memo | 60.0 us | 59.4 us: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | deltablue | 4.48 ms | 4.41 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | django_template | 53.3 ms | 51.9 ms: 1.03x faster | +--------------------------+--------------------------+-----------------------------+ | docutils | 3.56 sec | 3.52 sec: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | fannkuch | 557 ms | 548 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | create_gc_cycles | 1.57 ms | 1.58 ms: 1.01x slower | +--------------------------+--------------------------+-----------------------------+ | gc_traversal | 4.47 ms | 4.47 ms: 1.00x slower | +--------------------------+--------------------------+-----------------------------+ | generators | 40.5 ms | 40.1 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | genshi_text | 34.5 ms | 33.7 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | genshi_xml | 75.5 ms | 73.5 ms: 1.03x faster | +--------------------------+--------------------------+-----------------------------+ | go | 201 ms | 197 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | html5lib | 91.4 ms | 90.4 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | json_loads | 37.2 us | 36.8 us: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | logging_silent | 146 ns | 144 ns: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | logging_simple | 9.00 us | 8.84 us: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | mako | 17.0 ms | 16.8 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | mdp | 3.59 sec | 3.55 sec: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | meteor_contest | 148 ms | 144 ms: 1.03x faster | +--------------------------+--------------------------+-----------------------------+ | nqueens | 122 ms | 121 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | pathlib | 27.3 ms | 26.6 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | pickle | 14.9 us | 15.1 us: 1.01x slower | +--------------------------+--------------------------+-----------------------------+ | pickle_dict | 40.9 us | 41.9 us: 1.02x slower | +--------------------------+--------------------------+-----------------------------+ | pickle_pure_python | 428 us | 420 us: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | pidigits | 249 ms | 249 ms: 1.00x faster | +--------------------------+--------------------------+-----------------------------+ | pprint_safe_repr | 1.09 sec | 1.08 sec: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | pprint_pformat | 2.21 sec | 2.19 sec: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | pyflate | 642 ms | 635 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | python_startup | 15.5 ms | 15.5 ms: 1.00x faster | +--------------------------+--------------------------+-----------------------------+ | python_startup_no_site | 10.2 ms | 10.2 ms: 1.00x faster | +--------------------------+--------------------------+-----------------------------+ | raytrace | 371 ms | 369 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | regex_compile | 197 ms | 194 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | regex_dna | 253 ms | 242 ms: 1.04x faster | +--------------------------+--------------------------+-----------------------------+ | regex_effbot | 4.60 ms | 4.49 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | regex_v8 | 33.6 ms | 31.9 ms: 1.05x faster | +--------------------------+--------------------------+-----------------------------+ | richards | 66.7 ms | 65.3 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | richards_super | 74.9 ms | 73.5 ms: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | scimark_fft | 560 ms | 539 ms: 1.04x faster | +--------------------------+--------------------------+-----------------------------+ | scimark_monte_carlo | 97.7 ms | 96.7 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | scimark_sor | 191 ms | 186 ms: 1.03x faster | +--------------------------+--------------------------+-----------------------------+ | spectral_norm | 166 ms | 167 ms: 1.01x slower | +--------------------------+--------------------------+-----------------------------+ | sqlglot_normalize | 159 ms | 157 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | sqlglot_optimize | 79.1 ms | 78.7 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | sqlglot_transpile | 2.23 ms | 2.21 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | sympy_expand | 683 ms | 675 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | sympy_integrate | 28.4 ms | 28.2 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | sympy_str | 404 ms | 400 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | telco | 11.6 ms | 11.5 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | tomli_loads | 3.02 sec | 2.96 sec: 1.02x faster | +--------------------------+--------------------------+-----------------------------+ | tornado_http | 180 ms | 178 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | typing_runtime_protocols | 236 us | 239 us: 1.02x slower | +--------------------------+--------------------------+-----------------------------+ | unpack_sequence | 61.8 ns | 68.9 ns: 1.12x slower | +--------------------------+--------------------------+-----------------------------+ | unpickle_pure_python | 307 us | 305 us: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | xml_etree_process | 84.8 ms | 84.3 ms: 1.01x faster | +--------------------------+--------------------------+-----------------------------+ | Geometric mean | (ref) | 1.01x faster | +--------------------------+--------------------------+-----------------------------+ Benchmark hidden because not significant (38): async_generators, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, as ync_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_e ager_memoization, async_tree_eager_memoization_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_m emoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_websockets, comprehensions, bench_mp_pool, bench_thread_pool, co verage, dask, dulwich_log, float, hexiom, json_dumps, logging_format, nbody, pickle_list, scimark_lu, scimark_sparse_mat _mult, sqlglot_parse, sqlite_synth, sympy_sum, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_ generate -Will Cohen -- _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue