function alignment for AMD Bulldozer/Piledriver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I'm looking for a helpful explanation as to why GCC 8.3 with -O3 on architectures bdver1-4 and btver1&2 uses a conditional alignment for functions, whereas for other CPUs it doesn't.

For bdver1:

$> gcc -Q --help=optimizer -march=bdver1 -O3|grep 'falign.*='
  -falign-functions=                  11
  -falign-jumps=                      16
  -falign-labels=                     0
  -falign-loops=                      16

Others:

$> for n in core2 corei7 westmere ivybridge haswell broadwell skylake opteron athlon64 athlon-fx znver1 x86-64; do echo "$n: $(cc -Q --help=optimizer -march=$n -O3|grep 'falign-func.*=')"; done
core2:   -falign-functions=                  16
corei7:   -falign-functions=                  16
westmere:   -falign-functions=                  16
ivybridge:   -falign-functions=                  16
haswell:   -falign-functions=                  16
broadwell:   -falign-functions=                  16
skylake:   -falign-functions=                  16
opteron:   -falign-functions=                  16
athlon64:   -falign-functions=                  16
athlon-fx:   -falign-functions=                  16
znver1:   -falign-functions=                  16
x86-64:   -falign-functions=                  16

According to https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf#page=36 AMD Fam. 15h Optimization Guide, 2.7 Instruction Fetch and Decode:

"... Because the fetch unit provides instructions to the decode unit in aligned 16-byte blocks, aligning instruction blocks to 16-byte boundaries is important to acheive full decode performance. ..."

And https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf#page=95, 5.8 Code Padding with Operand-Size Override and Multibyte NOP (page 94-95):

"... Note that NOP instructions which contain more than three prefix bytes degrade performance; in this case, use two NOPs to implement the alignment. ..."

Thus, the only explanation I can think of for why this was done is that it should have been "-falign-loops=11" instead of "-falign-functions=11" and because NOP instructions with up to 11 bytes don't suffer a penalty.

Judging by how these values were implemented internally in the past (with an array assignment) could it be a simple bug - a twist in numbers.

Looking at the upcoming GCC 9.0 and its new and improved alignment handling, where it now uses a string-representation, does it in fact no longer use a value of 16 but of 11 for aligning loops, but it continues to hold on to the same old value for a function alignment.

The difference it makes in practise (11 or 16 for -falign-functions) is only very small and I've found it often lies within the margin of error.

Can someone please provide an explanation for why the architectures bdver and btver different from the rest?

Sven





[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux