Re: Applying code layout optimization to postgresql16 RPMs in Fedora 41 gave a 3%-6% improvement in IPC

William Cohen <wcohen@xxxxxxxxxx> · Wed, 4 Dec 2024 14:32:35 -0500

On 11/21/24 17:32, Miro Hrončok wrote:
> On 21. 11. 24 23:11, William Cohen wrote:
>> Sediment has been designed to work with the RPM build process.
>> Currently, one needs to use modified RPM macros.  These can be created
>> quickly by writing the output of the sediment make_sediment_rpmmacros
>> command into ~/.rpmmacros.  One will also need to define set the pgo
>> macro to 1 for the rpmbuild process.  The rpm spec file has minimal
>> modifications.  It has the callgraph files stored as a source file and
>> a defines the global call_graph to the source file that holds the call
>> graph.
> 
> Hey Will,
> 
> let's say I wan to try this for Python. Where do I start? The README on https://github.com/wcohen/sediment is not very helpful.
> 
> This is what I did based on your email:
> 
> $ sudo dnf --enable-repo=updates-testing install sediment
> ...
> Installing sediment-0:0.9.3-1.fc41.noarch
> 
> I run make_sediment_rpmmacros, it gives me some macros. Now I am supposed to put those to ~/.rpmmacros. Exccept I never build Python loclly, I use Koji or mock. I can probably amend this to use %global and insert it to python3.14.spec. But what else I need to do? Do you have a step by step kind of document I can follow?
> 

Hi Miro,

The tooling doesn't yet fit your work flow of building packages in
koji and mock.  I am looking into ways of addressing that issue.

I an earlier email I mentioned the important thing was have good
profiling data.  Do you have suggestions on some benchmarks that would
properly exercise the python interpreter?  I have used pyperformance
(https://github.com/python/pyperformance) to get some call graph data
for python and added that to a python3.13 srpm available at
https://koji.fedoraproject.org/koji/taskinfo?taskID=126526066.  Note
Koji is NOT building code layout optimization.  One would still need
to build locally python3.13-3.13.0-1.fc41.src.rpm with sediment-0.9.4
(https://koji.fedoraproject.org/koji/buildinfo?buildID=2596791)
installed and ~/.rpmmacros following steps:

   make_sediment_rpmmacros > ~/.rpmmacros
   rpm -Uvh python3.13-3.13.0-1.fc41.src.rpm
   cd ~/rpmbuild/SPECS
   rpmbuild -ba --define "pgo 1" python3.13.spec

The notable difference in the python3.13.spec file is the addition of:

# Call graph information
SOURCE12: perf_pybenchmark.gv
%global call_graph %{SOURCE12}

The perf_pybenchmark.gv was generated with steps:

   python3 -m pip install pyperformance
   perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json
   perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out
   perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv

Added the file to the python srpm:

   cp  perf_pybenchmark.gv ~/rpmbuild/SOURCES/.
   # edit ~/rpmbuild/SPECS/python3.13.spec to add call graph info

The improvements were mixed between the code layout optimized python
and the baseline version of the pyperformance benchmarks.  This can be
seen in the attached python_pgo.out generated by:

   python3 -m pyperf compare_to fc41_x86_python_baseline.json fc41_x86_python_pgo.json --table > python_pgo.out

It looks like a number of the benchmarks are microbenchmarks that are
unlikely the benefit much for the code layout optimizations.

Are there other python performance tests that you would suggest that
have have larger footprint and would better gauge the possible
performance improvement from the code layout optimization?

Are there better python code examples to collect profiling data on?

-Will
Benchmarks with tag 'apps':
===========================

Benchmark hidden because not significant (5): 2to3, chameleon, docutils, html5lib, tornado_http

Benchmarks with tag 'asyncio':
==============================

+-------------------------+--------------------------+-----------------------+
| Benchmark               | fc41_x86_python_baseline | fc41_x86_python_pgo   |
+=========================+==========================+=======================+
| async_tree_cpu_io_mixed | 613 ms                   | 621 ms: 1.01x slower  |
+-------------------------+--------------------------+-----------------------+
| async_tree_eager        | 120 ms                   | 123 ms: 1.03x slower  |
+-------------------------+--------------------------+-----------------------+
| async_tree_eager_tg     | 77.1 ms                  | 78.3 ms: 1.02x slower |
+-------------------------+--------------------------+-----------------------+
| Geometric mean          | (ref)                    | 1.01x slower          |
+-------------------------+--------------------------+-----------------------+

Benchmark hidden because not significant (13): async_tree_none, async_tree_cpu_io_mixed_tg, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg

Benchmarks with tag 'math':
===========================

+----------------+--------------------------+-----------------------+
| Benchmark      | fc41_x86_python_baseline | fc41_x86_python_pgo   |
+================+==========================+=======================+
| float          | 92.5 ms                  | 91.3 ms: 1.01x faster |
+----------------+--------------------------+-----------------------+
| Geometric mean | (ref)                    | 1.00x faster          |
+----------------+--------------------------+-----------------------+

Benchmark hidden because not significant (2): nbody, pidigits

Benchmarks with tag 'regex':
============================

+----------------+--------------------------+-----------------------+
| Benchmark      | fc41_x86_python_baseline | fc41_x86_python_pgo   |
+================+==========================+=======================+
| regex_compile  | 151 ms                   | 148 ms: 1.01x faster  |
+----------------+--------------------------+-----------------------+
| regex_dna      | 194 ms                   | 188 ms: 1.03x faster  |
+----------------+--------------------------+-----------------------+
| regex_effbot   | 3.55 ms                  | 3.44 ms: 1.03x faster |
+----------------+--------------------------+-----------------------+
| regex_v8       | 25.7 ms                  | 24.3 ms: 1.06x faster |
+----------------+--------------------------+-----------------------+
| Geometric mean | (ref)                    | 1.03x faster          |
+----------------+--------------------------+-----------------------+

Benchmarks with tag 'serialize':
================================

+----------------------+--------------------------+-----------------------+
| Benchmark            | fc41_x86_python_baseline | fc41_x86_python_pgo   |
+======================+==========================+=======================+
| json_dumps           | 11.8 ms                  | 12.1 ms: 1.03x slower |
+----------------------+--------------------------+-----------------------+
| json_loads           | 28.9 us                  | 28.7 us: 1.01x faster |
+----------------------+--------------------------+-----------------------+
| pickle               | 11.9 us                  | 11.5 us: 1.03x faster |
+----------------------+--------------------------+-----------------------+
| pickle_dict          | 34.1 us                  | 31.5 us: 1.08x faster |
+----------------------+--------------------------+-----------------------+
| pickle_list          | 5.05 us                  | 4.82 us: 1.05x faster |
+----------------------+--------------------------+-----------------------+
| unpickle             | 16.2 us                  | 16.4 us: 1.01x slower |
+----------------------+--------------------------+-----------------------+
| unpickle_pure_python | 236 us                   | 241 us: 1.02x slower  |
+----------------------+--------------------------+-----------------------+
| xml_etree_iterparse  | 108 ms                   | 107 ms: 1.01x faster  |
+----------------------+--------------------------+-----------------------+
| Geometric mean       | (ref)                    | 1.01x faster          |
+----------------------+--------------------------+-----------------------+

Benchmark hidden because not significant (6): pickle_pure_python, tomli_loads, unpickle_list, xml_etree_parse, xml_etree_generate, xml_etree_process

Benchmarks with tag 'startup':
==============================

+------------------------+--------------------------+-----------------------+
| Benchmark              | fc41_x86_python_baseline | fc41_x86_python_pgo   |
+========================+==========================+=======================+
| python_startup         | 13.5 ms                  | 12.9 ms: 1.05x faster |
+------------------------+--------------------------+-----------------------+
| python_startup_no_site | 9.18 ms                  | 8.48 ms: 1.08x faster |
+------------------------+--------------------------+-----------------------+
| Geometric mean         | (ref)                    | 1.07x faster          |
+------------------------+--------------------------+-----------------------+

Benchmarks with tag 'template':
===============================

+----------------+--------------------------+-----------------------+
| Benchmark      | fc41_x86_python_baseline | fc41_x86_python_pgo   |
+================+==========================+=======================+
| mako           | 13.1 ms                  | 13.3 ms: 1.02x slower |
+----------------+--------------------------+-----------------------+
| Geometric mean | (ref)                    | 1.01x slower          |
+----------------+--------------------------+-----------------------+

Benchmark hidden because not significant (3): django_template, genshi_text, genshi_xml

All benchmarks:
===============

+-------------------------+--------------------------+------------------------+
| Benchmark               | fc41_x86_python_baseline | fc41_x86_python_pgo    |
+=========================+==========================+========================+
| async_tree_cpu_io_mixed | 613 ms                   | 621 ms: 1.01x slower   |
+-------------------------+--------------------------+------------------------+
| async_tree_eager        | 120 ms                   | 123 ms: 1.03x slower   |
+-------------------------+--------------------------+------------------------+
| async_tree_eager_tg     | 77.1 ms                  | 78.3 ms: 1.02x slower  |
+-------------------------+--------------------------+------------------------+
| chaos                   | 67.0 ms                  | 65.7 ms: 1.02x faster  |
+-------------------------+--------------------------+------------------------+
| comprehensions          | 19.4 us                  | 19.7 us: 1.02x slower  |
+-------------------------+--------------------------+------------------------+
| bench_mp_pool           | 9.91 ms                  | 17.2 ms: 1.74x slower  |
+-------------------------+--------------------------+------------------------+
| bench_thread_pool       | 1.45 ms                  | 1.52 ms: 1.05x slower  |
+-------------------------+--------------------------+------------------------+
| coroutines              | 24.7 ms                  | 25.0 ms: 1.01x slower  |
+-------------------------+--------------------------+------------------------+
| crypto_pyaes            | 81.6 ms                  | 80.8 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| deepcopy                | 412 us                   | 416 us: 1.01x slower   |
+-------------------------+--------------------------+------------------------+
| deepcopy_reduce         | 3.72 us                  | 3.61 us: 1.03x faster  |
+-------------------------+--------------------------+------------------------+
| deepcopy_memo           | 46.0 us                  | 45.5 us: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| float                   | 92.5 ms                  | 91.3 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| create_gc_cycles        | 1.22 ms                  | 1.20 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| gc_traversal            | 3.30 ms                  | 3.25 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| generators              | 31.1 ms                  | 30.7 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| hexiom                  | 6.85 ms                  | 6.79 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| json_dumps              | 11.8 ms                  | 12.1 ms: 1.03x slower  |
+-------------------------+--------------------------+------------------------+
| json_loads              | 28.9 us                  | 28.7 us: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| logging_format          | 7.48 us                  | 7.63 us: 1.02x slower  |
+-------------------------+--------------------------+------------------------+
| mako                    | 13.1 ms                  | 13.3 ms: 1.02x slower  |
+-------------------------+--------------------------+------------------------+
| mdp                     | 2.78 sec                 | 2.63 sec: 1.06x faster |
+-------------------------+--------------------------+------------------------+
| nqueens                 | 93.0 ms                  | 92.5 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| pickle                  | 11.9 us                  | 11.5 us: 1.03x faster  |
+-------------------------+--------------------------+------------------------+
| pickle_dict             | 34.1 us                  | 31.5 us: 1.08x faster  |
+-------------------------+--------------------------+------------------------+
| pickle_list             | 5.05 us                  | 4.82 us: 1.05x faster  |
+-------------------------+--------------------------+------------------------+
| pyflate                 | 504 ms                   | 499 ms: 1.01x faster   |
+-------------------------+--------------------------+------------------------+
| python_startup          | 13.5 ms                  | 12.9 ms: 1.05x faster  |
+-------------------------+--------------------------+------------------------+
| python_startup_no_site  | 9.18 ms                  | 8.48 ms: 1.08x faster  |
+-------------------------+--------------------------+------------------------+
| raytrace                | 288 ms                   | 291 ms: 1.01x slower   |
+-------------------------+--------------------------+------------------------+
| regex_compile           | 151 ms                   | 148 ms: 1.01x faster   |
+-------------------------+--------------------------+------------------------+
| regex_dna               | 194 ms                   | 188 ms: 1.03x faster   |
+-------------------------+--------------------------+------------------------+
| regex_effbot            | 3.55 ms                  | 3.44 ms: 1.03x faster  |
+-------------------------+--------------------------+------------------------+
| regex_v8                | 25.7 ms                  | 24.3 ms: 1.06x faster  |
+-------------------------+--------------------------+------------------------+
| richards                | 51.3 ms                  | 52.0 ms: 1.01x slower  |
+-------------------------+--------------------------+------------------------+
| scimark_lu              | 128 ms                   | 130 ms: 1.01x slower   |
+-------------------------+--------------------------+------------------------+
| scimark_sor             | 147 ms                   | 148 ms: 1.01x slower   |
+-------------------------+--------------------------+------------------------+
| scimark_sparse_mat_mult | 5.89 ms                  | 5.81 ms: 1.01x faster  |
+-------------------------+--------------------------+------------------------+
| sqlglot_normalize       | 120 ms                   | 123 ms: 1.02x slower   |
+-------------------------+--------------------------+------------------------+
| sqlite_synth            | 2.42 us                  | 2.50 us: 1.04x slower  |
+-------------------------+--------------------------+------------------------+
| telco                   | 9.00 ms                  | 8.78 ms: 1.03x faster  |
+-------------------------+--------------------------+------------------------+
| unpickle                | 16.2 us                  | 16.4 us: 1.01x slower  |
+-------------------------+--------------------------+------------------------+
| unpickle_pure_python    | 236 us                   | 241 us: 1.02x slower   |
+-------------------------+--------------------------+------------------------+
| xml_etree_iterparse     | 108 ms                   | 107 ms: 1.01x faster   |
+-------------------------+--------------------------+------------------------+
| Geometric mean          | (ref)                    | 1.00x slower           |
+-------------------------+--------------------------+------------------------+

Benchmark hidden because not significant (58): 2to3, async_generators, async_tree_none, async_tree_cpu_io_mixed_tg, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, coverage, dask, deltablue, django_template, docutils, dulwich_log, fannkuch, genshi_text, genshi_xml, go, html5lib, logging_silent, logging_simple, meteor_contest, nbody, pathlib, pickle_pure_python, pidigits, pprint_safe_repr, pprint_pformat, richards_super, scimark_fft, scimark_monte_carlo, spectral_norm, sqlglot_optimize, sqlglot_parse, sqlglot_transpile, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tomli_loads, tornado_http, typing_runtime_protocols, unpack_sequence, unpickle_list, xml_etree_parse, xml_etree_generate, xml_etree_process
-- 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue