Re: Applying code layout optimization to postgresql16 RPMs in Fedora 41 gave a 3%-6% improvement in IPC

William Cohen <wcohen@xxxxxxxxxx> · Fri, 31 Jan 2025 20:10:57 -0500

On 12/5/24 9:45 AM, Miro Hrončok wrote:
> On 04. 12. 24 20:32, William Cohen wrote:
>> On 11/21/24 17:32, Miro Hrončok wrote:
>>> On 21. 11. 24 23:11, William Cohen wrote:
>>>> Sediment has been designed to work with the RPM build process.
>>>> Currently, one needs to use modified RPM macros.  These can be created
>>>> quickly by writing the output of the sediment make_sediment_rpmmacros
>>>> command into ~/.rpmmacros.  One will also need to define set the pgo
>>>> macro to 1 for the rpmbuild process.  The rpm spec file has minimal
>>>> modifications.  It has the callgraph files stored as a source file and
>>>> a defines the global call_graph to the source file that holds the call
>>>> graph.
>>>
>>> Hey Will,
>>>
>>> let's say I wan to try this for Python. Where do I start? The README on https://github.com/wcohen/sediment is not very helpful.
>>>
>>> This is what I did based on your email:
>>>
>>> $ sudo dnf --enable-repo=updates-testing install sediment
>>> ...
>>> Installing sediment-0:0.9.3-1.fc41.noarch
>>>
>>> I run make_sediment_rpmmacros, it gives me some macros. Now I am supposed to put those to ~/.rpmmacros. Exccept I never build Python loclly, I use Koji or mock. I can probably amend this to use %global and insert it to python3.14.spec. But what else I need to do? Do you have a step by step kind of document I can follow?
>>>
>>
>>
>> Hi Miro,
>>
>> The tooling doesn't yet fit your work flow of building packages in
>> koji and mock.  I am looking into ways of addressing that issue.
>>
>> I an earlier email I mentioned the important thing was have good
>> profiling data.  Do you have suggestions on some benchmarks that would
>> properly exercise the python interpreter?  I have used pyperformance
>> (https://github.com/python/pyperformance) to get some call graph data
>> for python and added that to a python3.13 srpm available at
>> https://koji.fedoraproject.org/koji/taskinfo?taskID=126526066. ; Note
>> Koji is NOT building code layout optimization.  One would still need
>> to build locally python3.13-3.13.0-1.fc41.src.rpm with sediment-0.9.4
>> (https://koji.fedoraproject.org/koji/buildinfo?buildID=2596791)
>> installed and ~/.rpmmacros following steps:
>>
>>     make_sediment_rpmmacros > ~/.rpmmacros
>>     rpm -Uvh python3.13-3.13.0-1.fc41.src.rpm
>>     cd ~/rpmbuild/SPECS
>>     rpmbuild -ba --define "pgo 1" python3.13.spec
>>
>> The notable difference in the python3.13.spec file is the addition of:
>>
>> # Call graph information
>> SOURCE12: perf_pybenchmark.gv
>> %global call_graph %{SOURCE12}
>>
>> The perf_pybenchmark.gv was generated with steps:
>>
>>     python3 -m pip install pyperformance
>>     perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json
>>     perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out
>>     perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv
>>
>> Added the file to the python srpm:
>>
>>     cp  perf_pybenchmark.gv ~/rpmbuild/SOURCES/.
>>     # edit ~/rpmbuild/SPECS/python3.13.spec to add call graph info
>>     The improvements were mixed between the code layout optimized python
>> and the baseline version of the pyperformance benchmarks.  This can be
>> seen in the attached python_pgo.out generated by:
>>
>>     python3 -m pyperf compare_to fc41_x86_python_baseline.json fc41_x86_python_pgo.json --table > python_pgo.out
>>
>> It looks like a number of the benchmarks are microbenchmarks that are
>> unlikely the benefit much for the code layout optimizations.
>>
>> Are there other python performance tests that you would suggest that
>> have have larger footprint and would better gauge the possible
>> performance improvement from the code layout optimization?
>>
>> Are there better python code examples to collect profiling data on?
> Hey Will,
> 
> thanks for looking into this.
> 
> For your question: Upstream is using this for PGO:
> 
>   $ python3.14 -m test --pgo
> 
> Or:
> 
>   $ python3.14 -m test --pgo-extended
> 
> In spec, this can be used:
> 
>   LD_LIBRARY_PATH=./build/optimized ./build/optimized/python -m test ...
> 
> ---
> 
> What is the blocker to run this in Koji/mock?
> 
> You do `make_sediment_rpmmacros > ~/.rpmmacros`.
> 
> What's the issue with %defining such macros at spec level?
> 

Hi,

I was able to do some experiments with the koji/mock buildable python3.13-3.13.0-1.fc41_opt.src.rpm (https://koji.fedoraproject.org/koji/taskinfo?taskID=128437060) and get better measurements of the performance impact With vstinner's suggestions for doing profiling of python. On a Lenovo P51 laptop running Fedora 41 I built two versions of rpms. Training data collected on pyperformance run and analyzed using sediment tool with:

   python3 -m pip install pyperformance
   perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json
   perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out
   perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv

Installed the srpm, went into the SPECS directory, and built the code layout optimized RPMs (have an added _opt in the names) with:

   rpm -Uvh python3.13-3.13.0-1.fc41_opt.src.rpm 
   cd ~/rpmbuild/SPECS
   rpmbuild -ba python3.13.spec

Built RPMs without the code-layout optimization (no _opt in the RPM names):

  rpmbuild --without opt -ba python3.13.spec 

Installed the code-layout RPMs, set up the environment for benchmarking, and ran the tests:

  sudo dnf install ~/rpmbuild/RPMS/x86_64/python*fc41_opt* ~/rpmbuild/RPMS/noarch/python-unversioned-command-3.13.0-1.fc41_opt.noarch.rpm
  sudo python3 -m pyperf system tune
  pyperformance run -f -o fc41_x86_python_opt20250131.json >& fc41_pybench_opt_20250131.log

Then collected data for the non-optimized version of the rpms:

sudo dnf install ~/rpmbuild/RPMS/x86_64/python*fc41.* ~/rpmbuild/RPMS/noarch/python-unversioned-command-3.13.0-1.fc41.noarch.rpm 
sudo python3 -m pyperf system tune
pyperformance run -f -o fc41_x86_python_20250131.json >& fc41_pybench_20250131.log

Once done compared the data between the runs with:

 python3 -m pyperf compare_to fc41_x86_python_20250131.json  fc41_x86_python_opt20250131.json --table > python_opt.out

Below is the comparison between the two versions python_opt.out). For the vast majority of the benchmarks the optimized code is slightly faster typically (1%).  The regex_* benchmarks appeared to be the largest benefit with regex_dna being 1.04x faster. There are several benchmarks that are slightly slower, pickle, pickle_dict, create_gc_cycles, spectral_norm, and typing_runtime_protocols.  The unpack_sequence was the worst, being 1.12x slower for the optimized code.  The improvements are not as noticeable as what was seen with postgresql.  I suspect that this might be due to the pyperformance has microbenchmarks and is not putting as much pressure on the iTLB as the large postgresql binary.

Benchmarks with tag 'apps':
===========================

+----------------+--------------------------+-----------------------------+
| Benchmark      | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+================+==========================+=============================+
| 2to3           | 378 ms                   | 374 ms: 1.01x faster        |
+----------------+--------------------------+-----------------------------+
| chameleon      | 10.2 ms                  | 10.1 ms: 1.01x faster       |
+----------------+--------------------------+-----------------------------+
| docutils       | 3.56 sec                 | 3.52 sec: 1.01x faster      |
+----------------+--------------------------+-----------------------------+
| html5lib       | 91.4 ms                  | 90.4 ms: 1.01x faster       |
+----------------+--------------------------+-----------------------------+
| tornado_http   | 180 ms                   | 178 ms: 1.01x faster        |
+----------------+--------------------------+-----------------------------+
| Geometric mean | (ref)                    | 1.01x faster                |
+----------------+--------------------------+-----------------------------+

Benchmarks with tag 'asyncio':
==============================

+---------------------+--------------------------+-----------------------------+
| Benchmark           | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+=====================+==========================+=============================+
| async_tree_none     | 462 ms                   | 459 ms: 1.01x faster        |
+---------------------+--------------------------+-----------------------------+
| async_tree_eager    | 159 ms                   | 158 ms: 1.01x faster        |
+---------------------+--------------------------+-----------------------------+
| async_tree_eager_tg | 104 ms                   | 102 ms: 1.01x faster        |
+---------------------+--------------------------+-----------------------------+
| Geometric mean      | (ref)                    | 1.01x faster                |
+---------------------+--------------------------+-----------------------------+

Benchmark hidden because not significant (13): async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager_cpu
_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, 
async_tree_eager_memoization_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, asy
nc_tree_none_tg

Benchmarks with tag 'math':
===========================

+----------------+--------------------------+-----------------------------+
| Benchmark      | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+================+==========================+=============================+
| pidigits       | 249 ms                   | 249 ms: 1.00x faster        |
+----------------+--------------------------+-----------------------------+
| Geometric mean | (ref)                    | 1.00x faster                |
+----------------+--------------------------+-----------------------------+

Benchmark hidden because not significant (2): float, nbody

Benchmarks with tag 'regex':
============================

+----------------+--------------------------+-----------------------------+
| Benchmark      | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+================+==========================+=============================+
| regex_compile  | 197 ms                   | 194 ms: 1.01x faster        |
+----------------+--------------------------+-----------------------------+
| regex_dna      | 253 ms                   | 242 ms: 1.04x faster        |
+----------------+--------------------------+-----------------------------+
| regex_effbot   | 4.60 ms                  | 4.49 ms: 1.02x faster       |
+----------------+--------------------------+-----------------------------+
| regex_v8       | 33.6 ms                  | 31.9 ms: 1.05x faster       |
+----------------+--------------------------+-----------------------------+
| Geometric mean | (ref)                    | 1.03x faster                |
+----------------+--------------------------+-----------------------------+

Benchmarks with tag 'serialize':
================================

+----------------------+--------------------------+-----------------------------+
| Benchmark            | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+======================+==========================+=============================+
| json_loads           | 37.2 us                  | 36.8 us: 1.01x faster       |
+----------------------+--------------------------+-----------------------------+
| pickle               | 14.9 us                  | 15.1 us: 1.01x slower       |
+----------------------+--------------------------+-----------------------------+
| pickle_dict          | 40.9 us                  | 41.9 us: 1.02x slower       |
+----------------------+--------------------------+-----------------------------+
| pickle_pure_python   | 428 us                   | 420 us: 1.02x faster        |
+----------------------+--------------------------+-----------------------------+
| tomli_loads          | 3.02 sec                 | 2.96 sec: 1.02x faster      |
+----------------------+--------------------------+-----------------------------+
| unpickle_pure_python | 307 us                   | 305 us: 1.01x faster        |
+----------------------+--------------------------+-----------------------------+
| xml_etree_process    | 84.8 ms                  | 84.3 ms: 1.01x faster       |
+----------------------+--------------------------+-----------------------------+
| Geometric mean       | (ref)                    | 1.00x slower                |
+----------------------+--------------------------+-----------------------------+

Benchmark hidden because not significant (7): json_dumps, pickle_list, unpickle, unpickle_list, xml_etree_parse, xml_etr
ee_iterparse, xml_etree_generate

Benchmarks with tag 'startup':
==============================

+------------------------+--------------------------+-----------------------------+
| Benchmark              | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+========================+==========================+=============================+
| python_startup         | 15.5 ms                  | 15.5 ms: 1.00x faster       |
+------------------------+--------------------------+-----------------------------+
| python_startup_no_site | 10.2 ms                  | 10.2 ms: 1.00x faster       |
+------------------------+--------------------------+-----------------------------+
| Geometric mean         | (ref)                    | 1.00x faster                |
+------------------------+--------------------------+-----------------------------+

Benchmarks with tag 'template':
===============================

+-----------------+--------------------------+-----------------------------+
| Benchmark       | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+=================+==========================+=============================+
| django_template | 53.3 ms                  | 51.9 ms: 1.03x faster       |
+-----------------+--------------------------+-----------------------------+
| genshi_text     | 34.5 ms                  | 33.7 ms: 1.02x faster       |
+-----------------+--------------------------+-----------------------------+
| genshi_xml      | 75.5 ms                  | 73.5 ms: 1.03x faster       |
+-----------------+--------------------------+-----------------------------+
| mako            | 17.0 ms                  | 16.8 ms: 1.01x faster       |
+-----------------+--------------------------+-----------------------------+
| Geometric mean  | (ref)                    | 1.02x faster                |
+-----------------+--------------------------+-----------------------------+

All benchmarks:
===============

+--------------------------+--------------------------+-----------------------------+
| Benchmark                | fc41_x86_python_20250131 | fc41_x86_python_opt20250131 |
+==========================+==========================+=============================+
| 2to3                     | 378 ms                   | 374 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| async_tree_none          | 462 ms                   | 459 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| async_tree_eager         | 159 ms                   | 158 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| async_tree_eager_tg      | 104 ms                   | 102 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| asyncio_tcp_ssl          | 1.84 sec                 | 1.83 sec: 1.01x faster      |
+--------------------------+--------------------------+-----------------------------+
| chameleon                | 10.2 ms                  | 10.1 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| chaos                    | 84.6 ms                  | 83.6 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| coroutines               | 33.4 ms                  | 32.8 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| crypto_pyaes             | 105 ms                   | 103 ms: 1.02x faster        |
+--------------------------+--------------------------+-----------------------------+
| deepcopy                 | 545 us                   | 530 us: 1.03x faster        |
+--------------------------+--------------------------+-----------------------------+
| deepcopy_reduce          | 4.81 us                  | 4.68 us: 1.03x faster       |
+--------------------------+--------------------------+-----------------------------+
| deepcopy_memo            | 60.0 us                  | 59.4 us: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| deltablue                | 4.48 ms                  | 4.41 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| django_template          | 53.3 ms                  | 51.9 ms: 1.03x faster       |
+--------------------------+--------------------------+-----------------------------+
| docutils                 | 3.56 sec                 | 3.52 sec: 1.01x faster      |
+--------------------------+--------------------------+-----------------------------+
| fannkuch                 | 557 ms                   | 548 ms: 1.02x faster        |
+--------------------------+--------------------------+-----------------------------+
| create_gc_cycles         | 1.57 ms                  | 1.58 ms: 1.01x slower       |
+--------------------------+--------------------------+-----------------------------+
| gc_traversal             | 4.47 ms                  | 4.47 ms: 1.00x slower       |
+--------------------------+--------------------------+-----------------------------+
| generators               | 40.5 ms                  | 40.1 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| genshi_text              | 34.5 ms                  | 33.7 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| genshi_xml               | 75.5 ms                  | 73.5 ms: 1.03x faster       |
+--------------------------+--------------------------+-----------------------------+
| go                       | 201 ms                   | 197 ms: 1.02x faster        |
+--------------------------+--------------------------+-----------------------------+
| html5lib                 | 91.4 ms                  | 90.4 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| json_loads               | 37.2 us                  | 36.8 us: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| logging_silent           | 146 ns                   | 144 ns: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| logging_simple           | 9.00 us                  | 8.84 us: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| mako                     | 17.0 ms                  | 16.8 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| mdp                      | 3.59 sec                 | 3.55 sec: 1.01x faster      |
+--------------------------+--------------------------+-----------------------------+
| meteor_contest           | 148 ms                   | 144 ms: 1.03x faster        |
+--------------------------+--------------------------+-----------------------------+
| nqueens                  | 122 ms                   | 121 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| pathlib                  | 27.3 ms                  | 26.6 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| pickle                   | 14.9 us                  | 15.1 us: 1.01x slower       |
+--------------------------+--------------------------+-----------------------------+
| pickle_dict              | 40.9 us                  | 41.9 us: 1.02x slower       |
+--------------------------+--------------------------+-----------------------------+
| pickle_pure_python       | 428 us                   | 420 us: 1.02x faster        |
+--------------------------+--------------------------+-----------------------------+
| pidigits                 | 249 ms                   | 249 ms: 1.00x faster        |
+--------------------------+--------------------------+-----------------------------+
| pprint_safe_repr         | 1.09 sec                 | 1.08 sec: 1.01x faster      |
+--------------------------+--------------------------+-----------------------------+
| pprint_pformat           | 2.21 sec                 | 2.19 sec: 1.01x faster      |
+--------------------------+--------------------------+-----------------------------+
| pyflate                  | 642 ms                   | 635 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| python_startup           | 15.5 ms                  | 15.5 ms: 1.00x faster       |
+--------------------------+--------------------------+-----------------------------+
| python_startup_no_site   | 10.2 ms                  | 10.2 ms: 1.00x faster       |
+--------------------------+--------------------------+-----------------------------+
| raytrace                 | 371 ms                   | 369 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| regex_compile            | 197 ms                   | 194 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| regex_dna                | 253 ms                   | 242 ms: 1.04x faster        |
+--------------------------+--------------------------+-----------------------------+
| regex_effbot             | 4.60 ms                  | 4.49 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| regex_v8                 | 33.6 ms                  | 31.9 ms: 1.05x faster       |
+--------------------------+--------------------------+-----------------------------+
| richards                 | 66.7 ms                  | 65.3 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| richards_super           | 74.9 ms                  | 73.5 ms: 1.02x faster       |
+--------------------------+--------------------------+-----------------------------+
| scimark_fft              | 560 ms                   | 539 ms: 1.04x faster        |
+--------------------------+--------------------------+-----------------------------+
| scimark_monte_carlo      | 97.7 ms                  | 96.7 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| scimark_sor              | 191 ms                   | 186 ms: 1.03x faster        |
+--------------------------+--------------------------+-----------------------------+
| spectral_norm            | 166 ms                   | 167 ms: 1.01x slower        |
+--------------------------+--------------------------+-----------------------------+
| sqlglot_normalize        | 159 ms                   | 157 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| sqlglot_optimize         | 79.1 ms                  | 78.7 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| sqlglot_transpile        | 2.23 ms                  | 2.21 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| sympy_expand             | 683 ms                   | 675 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| sympy_integrate          | 28.4 ms                  | 28.2 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| sympy_str                | 404 ms                   | 400 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| telco                    | 11.6 ms                  | 11.5 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| tomli_loads              | 3.02 sec                 | 2.96 sec: 1.02x faster      |
+--------------------------+--------------------------+-----------------------------+
| tornado_http             | 180 ms                   | 178 ms: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| typing_runtime_protocols | 236 us                   | 239 us: 1.02x slower        |
+--------------------------+--------------------------+-----------------------------+
| unpack_sequence          | 61.8 ns                  | 68.9 ns: 1.12x slower       |
+--------------------------+--------------------------+-----------------------------+
| unpickle_pure_python     | 307 us                   | 305 us: 1.01x faster        |
+--------------------------+--------------------------+-----------------------------+
| xml_etree_process        | 84.8 ms                  | 84.3 ms: 1.01x faster       |
+--------------------------+--------------------------+-----------------------------+
| Geometric mean           | (ref)                    | 1.01x faster                |
+--------------------------+--------------------------+-----------------------------+

Benchmark hidden because not significant (38): async_generators, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, as
ync_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_e
ager_memoization, async_tree_eager_memoization_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_m
emoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_websockets, comprehensions, bench_mp_pool, bench_thread_pool, co
verage, dask, dulwich_log, float, hexiom, json_dumps, logging_format, nbody, pickle_list, scimark_lu, scimark_sparse_mat
_mult, sqlglot_parse, sqlite_synth, sympy_sum, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_
generate

-Will Cohen

-- 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue