Re: Applying code layout optimization to postgresql16 RPMs in Fedora 41 gave a 3%-6% improvement in IPC

Charalampos Stratakis <cstratak@xxxxxxxxxx> · Tue, 4 Feb 2025 03:21:07 +0100

On Sat, Feb 1, 2025 at 1:49 PM Miro Hrončok <mhroncok@xxxxxxxxxx> wrote:
On 01. 02. 25 2:10, William Cohen wrote:

> On 12/5/24 9:45 AM, Miro Hrončok wrote:

>> On 04. 12. 24 20:32, William Cohen wrote:

>>> On 11/21/24 17:32, Miro Hrončok wrote:

>>>> On 21. 11. 24 23:11, William Cohen wrote:

>>>>> Sediment has been designed to work with the RPM build process.

>>>>> Currently, one needs to use modified RPM macros.  These can be created

>>>>> quickly by writing the output of the sediment make_sediment_rpmmacros

>>>>> command into ~/.rpmmacros.  One will also need to define set the pgo

>>>>> macro to 1 for the rpmbuild process.  The rpm spec file has minimal

>>>>> modifications.  It has the callgraph files stored as a source file and

>>>>> a defines the global call_graph to the source file that holds the call

>>>>> graph.

>>>>

>>>> Hey Will,

>>>>

>>>> let's say I wan to try this for Python. Where do I start? The README on https://github.com/wcohen/sediment is not very helpful.

>>>>

>>>> This is what I did based on your email:

>>>>

>>>> $ sudo dnf --enable-repo=updates-testing install sediment

>>>> ...

>>>> Installing sediment-0:0.9.3-1.fc41.noarch

>>>>

>>>> I run make_sediment_rpmmacros, it gives me some macros. Now I am supposed to put those to ~/.rpmmacros. Exccept I never build Python loclly, I use Koji or mock. I can probably amend this to use %global and insert it to python3.14.spec. But what else I need to do? Do you have a step by step kind of document I can follow?

>>>>

>>>

>>>

>>> Hi Miro,

>>>

>>> The tooling doesn't yet fit your work flow of building packages in

>>> koji and mock.  I am looking into ways of addressing that issue.

>>>

>>> I an earlier email I mentioned the important thing was have good

>>> profiling data.  Do you have suggestions on some benchmarks that would

>>> properly exercise the python interpreter?  I have used pyperformance

>>> (https://github.com/python/pyperformance) to get some call graph data

>>> for python and added that to a python3.13 srpm available at

>>> https://koji.fedoraproject.org/koji/taskinfo?taskID=126526066.  Note

>>> Koji is NOT building code layout optimization.  One would still need

>>> to build locally python3.13-3.13.0-1.fc41.src.rpm with sediment-0.9.4

>>> (https://koji.fedoraproject.org/koji/buildinfo?buildID=2596791)

>>> installed and ~/.rpmmacros following steps:

>>>

>>>      make_sediment_rpmmacros > ~/.rpmmacros

>>>      rpm -Uvh python3.13-3.13.0-1.fc41.src.rpm

>>>      cd ~/rpmbuild/SPECS

>>>      rpmbuild -ba --define "pgo 1" python3.13.spec

>>>

>>> The notable difference in the python3.13.spec file is the addition of:

>>>

>>> # Call graph information

>>> SOURCE12: perf_pybenchmark.gv

>>> %global call_graph %{SOURCE12}

>>>

>>> The perf_pybenchmark.gv was generated with steps:

>>>

>>>      python3 -m pip install pyperformance

>>>      perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json

>>>      perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out

>>>      perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv

>>>

>>> Added the file to the python srpm:

>>>

>>>      cp  perf_pybenchmark.gv ~/rpmbuild/SOURCES/.

>>>      # edit ~/rpmbuild/SPECS/python3.13.spec to add call graph info

>>>      The improvements were mixed between the code layout optimized python

>>> and the baseline version of the pyperformance benchmarks.  This can be

>>> seen in the attached python_pgo.out generated by:

>>>

>>>      python3 -m pyperf compare_to fc41_x86_python_baseline.json fc41_x86_python_pgo.json --table > python_pgo.out

>>>

>>> It looks like a number of the benchmarks are microbenchmarks that are

>>> unlikely the benefit much for the code layout optimizations.

>>>

>>> Are there other python performance tests that you would suggest that

>>> have have larger footprint and would better gauge the possible

>>> performance improvement from the code layout optimization?

>>>

>>> Are there better python code examples to collect profiling data on?

>> Hey Will,

>>

>> thanks for looking into this.

>>

>> For your question: Upstream is using this for PGO:

>>

>>    $ python3.14 -m test --pgo

>>

>> Or:

>>

>>    $ python3.14 -m test --pgo-extended

>>

>> In spec, this can be used:

>>

>>    LD_LIBRARY_PATH=./build/optimized ./build/optimized/python -m test ...

>>

>> ---

>>

>> What is the blocker to run this in Koji/mock?

>>

>> You do `make_sediment_rpmmacros > ~/.rpmmacros`.

>>

>> What's the issue with %defining such macros at spec level?

>>

> 

> Hi,

> 

> I was able to do some experiments with the koji/mock buildable python3.13-3.13.0-1.fc41_opt.src.rpm (https://koji.fedoraproject.org/koji/taskinfo?taskID=128437060) and get better measurements of the performance impact With vstinner's suggestions for doing profiling of python. On a Lenovo P51 laptop running Fedora 41 I built two versions of rpms. Training data collected on pyperformance run and analyzed using sediment tool with:

> 

>     python3 -m pip install pyperformance

>     perf record -e branches:u -j any_call -o perf_pybenchmark.data pyperformance run -f -o fc41_x86_python_baseline.json

>     perf report -i perf_pybenchmark.data --no-demangle --sort=comm,dso_from,symbol_from,dso_to,symbol_to > perf_pybenchmark.out

>     perf2gv < perf_pybenchmark.out > perf_pybenchmark.gv

> 

> Installed the srpm, went into the SPECS directory, and built the code layout optimized RPMs (have an added _opt in the names) with:

> 

>     rpm -Uvh python3.13-3.13.0-1.fc41_opt.src.rpm

>     cd ~/rpmbuild/SPECS

>     rpmbuild -ba python3.13.spec

> 

> Built RPMs without the code-layout optimization (no _opt in the RPM names):

> 

>    rpmbuild --without opt -ba python3.13.spec

> 

> Installed the code-layout RPMs, set up the environment for benchmarking, and ran the tests:

> 

>    sudo dnf install ~/rpmbuild/RPMS/x86_64/python*fc41_opt* ~/rpmbuild/RPMS/noarch/python-unversioned-command-3.13.0-1.fc41_opt.noarch.rpm

>    sudo python3 -m pyperf system tune

>    pyperformance run -f -o fc41_x86_python_opt20250131.json >& fc41_pybench_opt_20250131.log

> 

> Then collected data for the non-optimized version of the rpms:

> 

> sudo dnf install ~/rpmbuild/RPMS/x86_64/python*fc41.* ~/rpmbuild/RPMS/noarch/python-unversioned-command-3.13.0-1.fc41.noarch.rpm

> sudo python3 -m pyperf system tune

> pyperformance run -f -o fc41_x86_python_20250131.json >& fc41_pybench_20250131.log

> 

> Once done compared the data between the runs with:

> 

>   python3 -m pyperf compare_to fc41_x86_python_20250131.json  fc41_x86_python_opt20250131.json --table > python_opt.out

> 

> Below is the comparison between the two versions python_opt.out). For the vast majority of the benchmarks the optimized code is slightly faster typically (1%).  The regex_* benchmarks appeared to be the largest benefit with regex_dna being 1.04x faster. There are several benchmarks that are slightly slower, pickle, pickle_dict, create_gc_cycles, spectral_norm, and typing_runtime_protocols.  The unpack_sequence was the worst, being 1.12x slower for the optimized code.  The improvements are not as noticeable as what was seen with postgresql.  I suspect that this might be due to the pyperformance has microbenchmarks and is not putting as much pressure on the iTLB as the large postgresql binary.

Thank you, Will!

I've CC'ed Charalampos, who is now looking into Python performance in Fedora+EL.

-- 

Miro Hrončok

-- 

Phone: +420777974800

Fedora Matrix: mhroncok

That's interesting actually although the main issue would be to gather representative perf data which Python supports since 3.12+. I suspect the tests run for pgo would make for an interesting case here but it will require some experimentation.

William, do I understand correctly that sediment uses the profile data to reorder the functions with "--section-ordering-file"? If so, could it be used in conjunction with AutoFDO (aka same profiling data)? Also any conflicts or issues that you might have encountered with LTO and/or PLO?

I'd have a look down the line when I get some free cycles.

Side note: You mention in your docs the GCC python plugin, however it has not seen any active development for a long time, the data extracted from there can be inconclusive.
-- 
Regards,

Charalampos Stratakis
Senior Software Engineer
Python Maintenance Team, Red Hat

-- 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue