function alignment for AMD Bulldozer/Piledriver

"Sven C. Dack" <sdack@xxxxxxx> · Sun, 7 Apr 2019 14:57:31 +0100

Hello,

I'm looking for a helpful explanation as to why GCC 8.3 with -O3 on 
architectures bdver1-4 and btver1&2 uses a conditional alignment for 
functions, whereas for other CPUs it doesn't.

For bdver1:

$> gcc -Q --help=optimizer -march=bdver1 -O3|grep 'falign.*='
  -falign-functions=                  11
  -falign-jumps=                      16
  -falign-labels=                     0
  -falign-loops=                      16

Others:

$> for n in core2 corei7 westmere ivybridge haswell broadwell skylake 
opteron athlon64 athlon-fx znver1 x86-64; do echo "$n: $(cc -Q 
--help=optimizer -march=$n -O3|grep 'falign-func.*=')"; done
core2:   -falign-functions=                  16
corei7:   -falign-functions=                  16
westmere:   -falign-functions=                  16
ivybridge:   -falign-functions=                  16
haswell:   -falign-functions=                  16
broadwell:   -falign-functions=                  16
skylake:   -falign-functions=                  16
opteron:   -falign-functions=                  16
athlon64:   -falign-functions=                  16
athlon-fx:   -falign-functions=                  16
znver1:   -falign-functions=                  16
x86-64:   -falign-functions=                  16

According to 
https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf#page=36 
AMD Fam. 15h Optimization Guide, 2.7 Instruction Fetch and Decode:

"... Because the fetch unit provides instructions to the decode unit in 
aligned 16-byte blocks, aligning instruction blocks to 16-byte 
boundaries is important to acheive full decode performance. ..."

And 
https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf#page=95, 
5.8 Code Padding with Operand-Size Override and Multibyte NOP (page 94-95):

"... Note that NOP instructions which contain more than three prefix 
bytes degrade performance; in this case, use two NOPs to implement the 
alignment. ..."

Thus, the only explanation I can think of for why this was done is that 
it should have been "-falign-loops=11" instead of "-falign-functions=11" 
and because NOP instructions with up to 11 bytes don't suffer a penalty.

Judging by how these values were implemented internally in the past 
(with an array assignment) could it be a simple bug - a twist in numbers.

Looking at the upcoming GCC 9.0 and its new and improved alignment 
handling, where it now uses a string-representation, does it in fact no 
longer use a value of 16 but of 11 for aligning loops, but it continues 
to hold on to the same old value for a function alignment.

The difference it makes in practise (11 or 16 for -falign-functions) is 
only very small and I've found it often lies within the margin of error.

Can someone please provide an explanation for why the architectures 
bdver and btver different from the rest?

Sven