Greetings. Chatting on IRC today (freenode #gcc), someone brought up the following example code: #include <stdio.h> template <typename T> inline T const& max (T const& a, T const& b) { // if a < b then use b else use a return a<b?b:a; } int main() { long long unsigned sum = 0; for(int x = 1; x <= 100000000; x++) { sum+=max(x,x+1); } printf("%llu\n", sum); } They noticed that their earlier compiler (4.6.3 -O3) successfully reduced the loop, while 4.8.1 didn't. I tested it with 4.7.2 (on Fedora 18 x86-64). Without optimization, the loop was preserved in the object code: 00000000004005ec <main>: ... 400605: 8b 45 f0 mov -0x10(%rbp),%eax 400608: 83 c0 01 add $0x1,%eax 40060b: 89 45 f4 mov %eax,-0xc(%rbp) 40060e: 48 8d 55 f4 lea -0xc(%rbp),%rdx 400612: 48 8d 45 f0 lea -0x10(%rbp),%rax 400616: 48 89 d6 mov %rdx,%rsi 400619: 48 89 c7 mov %rax,%rdi 40061c: e8 3d 00 00 00 callq 40065e <_Z3maxIiERKT_S2_S2_> 400621: 8b 00 mov (%rax),%eax 400623: 48 98 cltq 400625: 48 01 45 f8 add %rax,-0x8(%rbp) 400629: 8b 45 f0 mov -0x10(%rbp),%eax 40062c: 83 c0 01 add $0x1,%eax 40062f: 89 45 f0 mov %eax,-0x10(%rbp) 400632: 8b 45 f0 mov -0x10(%rbp),%eax 400635: 3d 00 e1 f5 05 cmp $0x5f5e100,%eax 40063a: 0f 9e c0 setle %al 40063d: 84 c0 test %al,%al 40063f: 75 c4 jne 400605 <main+0x19> Using -O3, 4.7.2 successfully reduced the loop: 0000000000400500 <main>: 400500: 48 83 ec 08 sub $0x8,%rsp 400504: 48 be 80 51 d1 40 79 movabs $0x11c37940d15180,%rsi 40050b: c3 11 00 40050e: bf c0 06 40 00 mov $0x4006c0,%edi 400513: 31 c0 xor %eax,%eax 400515: e8 b6 ff ff ff callq 4004d0 <printf@plt> 40051a: 31 c0 xor %eax,%eax 40051c: 48 83 c4 08 add $0x8,%rsp 400520: c3 retq 400521: 0f 1f 00 nopl (%rax) Unfortunately, 4.8.1 -O3 does *not* collapse the loop. It uses the xmm registers, but it still does a loop (so far as I can tell -- I'm not very good at reading assembler): 00000000004004f0 <main>: ... 400518: 66 0f 6f cc movdqa %xmm4,%xmm1 40051c: 66 0f 6f d5 movdqa %xmm5,%xmm2 400520: 83 c0 01 add $0x1,%eax 400523: 66 0f 6f e1 movdqa %xmm1,%xmm4 400527: 66 0f fe ce paddd %xmm6,%xmm1 40052b: 3d 40 78 7d 01 cmp $0x17d7840,%eax 400530: 66 0f 66 d1 pcmpgtd %xmm1,%xmm2 400534: 66 0f 6f d9 movdqa %xmm1,%xmm3 400538: 66 0f fe e7 paddd %xmm7,%xmm4 40053c: 66 0f 62 da punpckldq %xmm2,%xmm3 400540: 66 0f 6a ca punpckhdq %xmm2,%xmm1 400544: 66 0f d4 c3 paddq %xmm3,%xmm0 400548: 66 0f d4 c1 paddq %xmm1,%xmm0 40054c: 75 ca jne 400518 <main+0x28> ... While this is noticably faster than the 4.7.2 loop, it's still an order of magnitude slower than the fully-collapsed loop given by -O3 on 4.7.2 and 4.6.3 (times are in nanoseconds) [1]: maxtest-4.7.2-O0: 332,635,216 maxtest-4.7.2-O3: 1,754,238 maxtest-4.8.1-O3: 28,128,557 Is this expected? Is it a side-effect of some of the other loop work that landed in 4.8? Are there other options that can restore the pre-4.8 behavior in this case? Thanks, Tony [1] Quick and dirty script for generating those numbers: for i in maxtest-* ; do t0=`date +'%s%N'` ; for j in {0..9} ; do ./$i > /dev/null ; done ; t1=`date +'%s%N'` ; echo -n "$i: " ; echo "($t1-$t0)/10" | bc ; done