loop optimized in 4.7.2, not in 4.8.1?

Anthony Foiani <tkil@xxxxxxxxx> · Sat, 10 Aug 2013 18:06:26 -0600

Greetings.

Chatting on IRC today (freenode #gcc), someone brought up the
following example code:

  #include <stdio.h>

  template <typename T>
  inline T const& max (T const& a, T const& b)
  {
      // if a < b then use b else use a
      return a<b?b:a;
  }

  int main()
  {
      long long unsigned sum = 0;
      for(int x = 1; x <= 100000000; x++)
      {
          sum+=max(x,x+1);
      }
      printf("%llu\n", sum);
  }

They noticed that their earlier compiler (4.6.3 -O3) successfully
reduced the loop, while 4.8.1 didn't.

I tested it with 4.7.2 (on Fedora 18 x86-64).  Without optimization,
the loop was preserved in the object code:

  00000000004005ec <main>:
    ...
    400605:        8b 45 f0               mov    -0x10(%rbp),%eax
    400608:        83 c0 01               add    $0x1,%eax
    40060b:        89 45 f4               mov    %eax,-0xc(%rbp)
    40060e:        48 8d 55 f4            lea    -0xc(%rbp),%rdx
    400612:        48 8d 45 f0            lea    -0x10(%rbp),%rax
    400616:        48 89 d6               mov    %rdx,%rsi
    400619:        48 89 c7               mov    %rax,%rdi
    40061c:        e8 3d 00 00 00         callq  40065e <_Z3maxIiERKT_S2_S2_>
    400621:        8b 00                  mov    (%rax),%eax
    400623:        48 98                  cltq   
    400625:        48 01 45 f8            add    %rax,-0x8(%rbp)
    400629:        8b 45 f0               mov    -0x10(%rbp),%eax
    40062c:        83 c0 01               add    $0x1,%eax
    40062f:        89 45 f0               mov    %eax,-0x10(%rbp)
    400632:        8b 45 f0               mov    -0x10(%rbp),%eax
    400635:        3d 00 e1 f5 05         cmp    $0x5f5e100,%eax
    40063a:        0f 9e c0               setle  %al
    40063d:        84 c0                  test   %al,%al
    40063f:        75 c4                  jne    400605 <main+0x19>

Using -O3, 4.7.2 successfully reduced the loop:

  0000000000400500 <main>:
    400500:        48 83 ec 08            sub    $0x8,%rsp
    400504:        48 be 80 51 d1 40 79   movabs $0x11c37940d15180,%rsi
    40050b:        c3 11 00 
    40050e:        bf c0 06 40 00         mov    $0x4006c0,%edi
    400513:        31 c0                  xor    %eax,%eax
    400515:        e8 b6 ff ff ff         callq  4004d0 <printf@plt>
    40051a:        31 c0                  xor    %eax,%eax
    40051c:        48 83 c4 08            add    $0x8,%rsp
    400520:        c3                     retq   
    400521:        0f 1f 00               nopl   (%rax)

Unfortunately, 4.8.1 -O3 does *not* collapse the loop.  It uses the
xmm registers, but it still does a loop (so far as I can tell -- I'm
not very good at reading assembler):

  00000000004004f0 <main>:
    ...
    400518:        66 0f 6f cc            movdqa %xmm4,%xmm1
    40051c:        66 0f 6f d5            movdqa %xmm5,%xmm2
    400520:        83 c0 01               add    $0x1,%eax
    400523:        66 0f 6f e1            movdqa %xmm1,%xmm4
    400527:        66 0f fe ce            paddd  %xmm6,%xmm1
    40052b:        3d 40 78 7d 01         cmp    $0x17d7840,%eax
    400530:        66 0f 66 d1            pcmpgtd %xmm1,%xmm2
    400534:        66 0f 6f d9            movdqa %xmm1,%xmm3
    400538:        66 0f fe e7            paddd  %xmm7,%xmm4
    40053c:        66 0f 62 da            punpckldq %xmm2,%xmm3
    400540:        66 0f 6a ca            punpckhdq %xmm2,%xmm1
    400544:        66 0f d4 c3            paddq  %xmm3,%xmm0
    400548:        66 0f d4 c1            paddq  %xmm1,%xmm0
    40054c:        75 ca                  jne    400518 <main+0x28>
    ...

While this is noticably faster than the 4.7.2 loop, it's still an
order of magnitude slower than the fully-collapsed loop given by -O3
on 4.7.2 and 4.6.3 (times are in nanoseconds) [1]:

  maxtest-4.7.2-O0:  332,635,216
  maxtest-4.7.2-O3:    1,754,238
  maxtest-4.8.1-O3:   28,128,557

Is this expected?  Is it a side-effect of some of the other loop work
that landed in 4.8?  Are there other options that can restore the
pre-4.8 behavior in this case?

Thanks,
Tony

[1] Quick and dirty script for generating those numbers:

  for i in maxtest-* ; do
    t0=`date +'%s%N'` ;
    for j in {0..9} ; do
      ./$i > /dev/null ;
    done ;
    t1=`date +'%s%N'` ;
    echo -n "$i: " ;
    echo "($t1-$t0)/10" | bc ;
  done