inline bug (?)

Jorge PEREZ <jorge.perez@xxxxxxxx> · Wed, 17 Nov 2010 17:47:09 +0100

Hello,

When compiling for SPARC using GCC 4.5.1 and the option -fwhole-program,
it seems that sometimes the decision to inline is not well taken since
the resulting code is bigger (I assume that when inline is applied the
objective is to reduce the size of the code and the number of executed
instructions).

The following example should help clarifying the point:

    short f1(char a, short b){
      int i;
      short c=0;
      for (i=0; i<b; i++){
        c+=a;
      }
      return a+c;
    }

    int main(){
      volatile short b=1;
      volatile char a=1, c=1;
      b=f1(a,b);
      b=f1(c,b);
      b=f1(a,b);
      b=f1(c,b);
      b=f1(a,b);          //NC=5
    /*  b=f1(c,b);   */   //NC=6
      return 0;
    }

The reasoning is described below (before hand I beg you apologies if
this is not rigorous or precise enough):

* function f1 takes 10 assembler instructions when it is not inlined
* compiling using the -fwhole-program allows to inline the function f1

------- Number of calls
* it is NOT advantageous to inline if there are more than 2 calls to f1
* f1 is inlined when there are less than 5 calls to f1
* f1 is NOT inlined when there are 6 or more calls to f1

------- Number of function parameters
* function f1 has two input parameters
* when f1 is modified to have more input parameters, it does the inline
more easily and the code size is even bigger
* when f1 is modified to have less input parameters, it does not the
inline so easily and code size is acceptable

------- Hypothesis about the inline:
NC=Number of Calls=5 or 6 for this test case
      b=f1(a,b);
      b=f1(c,b);
      b=f1(a,b);
      b=f1(c,b);
      b=f1(a,b);          //NC=5
    /*  b=f1(c,b);   */   //Uncomment for NC=6

CC=Cost of each call=1(call itself) +1 (delay slot)=2

CP=Cost of each parameter (i.e. instructions required for preparing the
inputs, a bit re-arranged)=ldub+sll+sra=3
    40000018:    c6 17 bf fc     lduh  [ %fp + -4 ], %g3
    40000020:    da 0f bf fe     ldub  [ %fp + -2 ], %o5
    40000024:    87 28 e0 18     sll  %g3, 0x18, %g3
    40000028:    9b 2b 60 18     sll  %o5, 0x18, %o5
    4000002c:    89 38 e0 18     sra  %g3, 0x18, %g4
    40000034:    9b 3b 60 18     sra  %o5, 0x18, %o5

FT=Total size of the function=10 (not inlined)
    40000000 <f1>:
    40000000:    82 10 20 00     clr  %g1
    40000004:    84 10 20 00     clr  %g2
    40000008:    10 80 00 03     b  40000014 <f1+0x14>
    4000000c:    86 10 00 08     mov  %o0, %g3
    40000010:    84 00 a0 01     inc  %g2
    40000014:    80 a0 80 09     cmp  %g2, %o1
    40000018:    26 bf ff fe     bl,a   40000010 <f1+0x10>
    4000001c:    82 00 c0 01     add  %g3, %g1, %g1
    40000020:    81 c3 e0 08     retl
    40000024:    90 02 00 01     add  %o0, %g1, %o0

FC=Size of the function's core=10 (when inlined)
    4000001c:    82 10 20 00     clr  %g1
    40000014:    84 10 20 00     clr  %g2
    40000030:    10 80 00 03     b  4000003c <main+0x3c>
    40000038:    84 00 a0 01     inc  %g2
    4000003c:    80 a0 80 0d     cmp  %g2, %o5
    40000040:    26 bf ff fe     bl,a   40000038 <main+0x38>
    40000044:    82 01 00 01     add  %g4, %g1, %g1
    40000048:    87 38 e0 18     sra  %g3, 0x18, %g3
    4000004c:    82 00 40 03     add  %g1, %g3, %g1
    40000050:    84 10 20 00     clr  %g2

NP=Number of function parameters=2
    short f1(char a, short b){

$$$$$$$ (Apparently) Actual Inlining condition $$$$$$$
NC*(CC+NP*CP) + FT>=NC*FC

so basically it should do the inline only if the resulting code size is
smaller than the code size without inlining.

It is true when NC=5 and false when NC=6:
    CC+NP*CP=2+2*3=8

    Test NC=5
        5*8+10 >= 5*10 (Condition is true, therefore DO INLINE)

    Test NC=6
        6*8+10 >= 6*10 (Condition is false, therefore DO NOT INLINE)

However, it should be false when NC=5 since the code with f1 inlined is
a lot bigger than when inline is not done (using -fno-inline).

The whole point seems to be the cost of the function parameters CP
(eventually the casts done in the caller) which are taken into account
in the size estimation only before the inlining and it seems that the
compilator believes that after the inline such casts are not going to be
done and therefore are not taken into account for the decision. However,
after the code compilation, such parameters are included anyway along
with the inlined code.

$$$$$$$ Correct Inlining condition $$$$$$$
NC*(CC+NP*CP) + FT>NC*(FC+CP)

If such condition is applied for the inline decision:
    CC+NP*CP=8
    FC+CP=10+3=13

    Test NC=5
        5*8+10 >= 5*13 (Condition is false, therefore DO NOT INLINE)

    Test NC=6
        6*8+10 >= 6*13 (Condition is false, therefore DO NOT INLINE)

Using the latter condition it is easily found out that inline is not
worth in either case (it would only be worth for 1 or 2 calls to f1).
Surely the actual method used for taking such decision is much more
evolved and complicated that this simple hypothesis, however I strongly
believe something similar might be happening in whatever model GCC uses.

I appreciate any feedback or suggestions you have about this, maybe I'm
doing it all wrong from the begining, but the fact that inline increases
the size of the code was weird to me.

Thanks in advance,

Jorge