Re: RFC: Use x86_64-v2 architecture

Mateusz Jończyk via arch-general <arch-general@xxxxxxxxxxxxxxxxxxx> · Sun, 14 Mar 2021 21:39:19 +0100

Hello,

I have run the benchmarks and here are the results:

https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55
<https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55>

TL;DR:

- there is no or negligible performance benefit of *-march=nehalem*, which
corresponds to x86_64-v2,

- there is a moderate benefit of *-march=haswell* (x86_64-v3) - of around
10%-20% as compared to baseline for the tests performed

Geometric Mean Of All Test Results
Result Composite
Geometric Mean > Higher Is Better
O1_generic ....... 367.99
O3_generic ....... 459.84
O3_march_nehalem . 462.89
O3_march_haswell . 531.99

x86_64-v2:

There were only two tests in which march=nehalem was meaningfully faster then
march=x86_64 (the baseline architecture). These were "graphicsmagick/Swirl" and
"FLAC audio encoding". FLAC results were quite noisy (click the "Result
confidence" button above the pie chart to show data) so the benefits may not be
statistically significant. Swirl appeared to be only around 4% faster. I was
surprised because I thought that the benefits would be somewhere around 5-10%.
It looks like GCC's autovectorisation does not make much use from the
instructions added in SSE3/SSSE3/SSE4.

x86_64-v3:

The geometric mean of test results was around 15% higher on march=haswell then
on baseline x86_64. Apart from john-the-ripper/md5, the tests were up to 36%
faster with median performance increase of around 10%. [1]

As described in my previous email, I have excluded tests that use dedicated code
paths for processors supporting AVX/AVX2/etc. - I saw little point of
benchmarking them. I have also excluded some tests with little difference
between the -O1 and -O3 optimization levels as it appears that the compiler has
little work to do there. So real-world performance benefits of compiling whole
Arch for x86_64-v3 would be probably smaller.

I think that many workloads of a "typical user" are I/O bound. The limiting
factor is likely to be a HDD/SSD, network throughput / latency or a memory speed.

Limitations:

- GCC 9.3.0 was used, which is not the most recent compiler available.

Further research:

- benchmarking web browser performance, as this is what matters most for many users,
- comparing battery usage (Phoronix Test Suite has support for this when running
benchmarks). I do not think it will be much different to performance data, though,

How to reproduce:

    export CFLAGS="-O1 -mtune=generic -march=x86-64"
    export CXXFLAGS="-O1 -mtune=generic -march=x86-64"
    phoronix-test-suite benchmark 2103142-HA-UARCHLEVE55
    export CFLAGS="-O3 -mtune=generic -march=x86-64"
    export CXXFLAGS="-O3 -mtune=generic -march=x86-64"
    phoronix-test-suite benchmark $name_of_test_identifier_specified_before
    #etc.

Conflict of interest:

I am opposed to increasing baseline x86_64 requirements in general-purpose
distributions.

Greetings,

Mateusz

[1] Visit

https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55&rmm=O1_generic%2CO3_march_nehalem

and scroll slightly lower.