Hello,
I'm looking for a helpful explanation as to why GCC 8.3 with -O3 on
architectures bdver1-4 and btver1&2 uses a conditional alignment for
functions, whereas for other CPUs it doesn't.
For bdver1:
$> gcc -Q --help=optimizer -march=bdver1 -O3|grep 'falign.*='
-falign-functions= 11
-falign-jumps= 16
-falign-labels= 0
-falign-loops= 16
Others:
$> for n in core2 corei7 westmere ivybridge haswell broadwell skylake
opteron athlon64 athlon-fx znver1 x86-64; do echo "$n: $(cc -Q
--help=optimizer -march=$n -O3|grep 'falign-func.*=')"; done
core2: -falign-functions= 16
corei7: -falign-functions= 16
westmere: -falign-functions= 16
ivybridge: -falign-functions= 16
haswell: -falign-functions= 16
broadwell: -falign-functions= 16
skylake: -falign-functions= 16
opteron: -falign-functions= 16
athlon64: -falign-functions= 16
athlon-fx: -falign-functions= 16
znver1: -falign-functions= 16
x86-64: -falign-functions= 16
According to
https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf#page=36
AMD Fam. 15h Optimization Guide, 2.7 Instruction Fetch and Decode:
"... Because the fetch unit provides instructions to the decode unit in
aligned 16-byte blocks, aligning instruction blocks to 16-byte
boundaries is important to acheive full decode performance. ..."
And
https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf#page=95,
5.8 Code Padding with Operand-Size Override and Multibyte NOP (page 94-95):
"... Note that NOP instructions which contain more than three prefix
bytes degrade performance; in this case, use two NOPs to implement the
alignment. ..."
Thus, the only explanation I can think of for why this was done is that
it should have been "-falign-loops=11" instead of "-falign-functions=11"
and because NOP instructions with up to 11 bytes don't suffer a penalty.
Judging by how these values were implemented internally in the past
(with an array assignment) could it be a simple bug - a twist in numbers.
Looking at the upcoming GCC 9.0 and its new and improved alignment
handling, where it now uses a string-representation, does it in fact no
longer use a value of 16 but of 11 for aligning loops, but it continues
to hold on to the same old value for a function alignment.
The difference it makes in practise (11 or 16 for -falign-functions) is
only very small and I've found it often lies within the margin of error.
Can someone please provide an explanation for why the architectures
bdver and btver different from the rest?
Sven