Re: Optimized program four times slower.

Tim Prince <tprince@xxxxxxxxxxxxx> · Sat, 14 Jul 2007 18:58:29 -0700

Brian Dessent wrote:
"Sebastien M." wrote:

The loop which call "empty_ptr()" is four times slower with 02.
Is there a logical reason for this ?

The only significant difference between the two seems to be that with
-O2 gcc chooses esi as the register to hold the function pointer.

-O0:

     time = clock();
  40136c:       e8 ef 02 00 00          call   401660 <_clock>
  401371:       89 45 f8                mov    %eax,0xfffffff8(%ebp)
     for(int i=0;i<numtests;i++) empty_ptr(x);
  401374:       c7 45 e4 00 00 00 00    movl   $0x0,0xffffffe4(%ebp)
  40137b:       8b 45 e4                mov    0xffffffe4(%ebp),%eax
  40137e:       3b 45 f4                cmp    0xfffffff4(%ebp),%eax
  401381:       7d 14                   jge    401397 <_main+0xaf>
  401383:       dd 45 e8                fldl   0xffffffe8(%ebp)
  401386:       dd 1c 24                fstpl  (%esp)
  401389:       8b 45 fc                mov    0xfffffffc(%ebp),%eax
  40138c:       ff d0                   call   *%eax
  40138e:       dd d8                   fstp   %st(0)
  401390:       8d 45 e4                lea    0xffffffe4(%ebp),%eax
  401393:       ff 00                   incl   (%eax)
  401395:       eb e4                   jmp    40137b <_main+0x93>
     printf("Time : empty_ptr = %ld\n", clock() - time);
  401397:       e8 c4 02 00 00          call   401660 <_clock>
  40139c:       2b 45 f8                sub    0xfffffff8(%ebp),%eax
  40139f:       89 44 24 04             mov    %eax,0x4(%esp)
  4013a3:       c7 04 24 14 30 40 00    movl   $0x403014,(%esp)
  4013aa:       e8 a1 02 00 00          call   401650 <_printf>

-O2:

     time = clock();
  40133f:       e8 ec 02 00 00          call   401630 <_clock>
  401344:       89 c7                   mov    %eax,%edi
  401346:       8d 76 00                lea    0x0(%esi),%esi
  401349:       8d bc 27 00 00 00 00    lea    0x0(%edi),%edi
     for(int i=0;i<numtests;i++) empty_ptr(x);
  401350:       c7 04 24 00 00 00 00    movl   $0x0,(%esp)
  401357:       b8 00 00 08 40          mov    $0x40080000,%eax
  40135c:       89 44 24 04             mov    %eax,0x4(%esp)
  401360:       ff d6                   call   *%esi
  401362:       dd d8                   fstp   %st(0)
  401364:       4b                      dec    %ebx
  401365:       79 e9                   jns    401350 <_main+0x60>
     printf("Time : empty_ptr = %ld\n", clock() - time);
  401367:       e8 c4 02 00 00          call   401630 <_clock>
  40136c:       c7 04 24 14 30 40 00    movl   $0x403014,(%esp)
  401373:       29 f8                   sub    %edi,%eax
  401375:       89 44 24 04             mov    %eax,0x4(%esp)
  401379:       e8 a2 02 00 00          call   401620 <_printf>

So it could just be a bad decision by the register allocator.  I don't
know what the performance difference between call *%eax and call *%esi
is.

Also note that gcc 4.3 with -O2 is smart enough to recognise that
neither of these loops do anything and remove them both entirely:

I don't have sufficient memory to recall whether the gcc version, 
hardware, or compile flags information was in the original post.  If you 
are running on a machine where the sequence dec ebx; jns is subject to a 
partial flags stall, that might be an explanation.  If you used an up to 
date gcc (evidently not), and wrote a loop which was not trivial to 
remove (evidently not), and got such code when specifying -march for a 
machine which doesn't like that sequence, this would be a bug for which 
you should file a PR.  Note that dec, rather than add -1, has been a 
standard way to write code which might be OK on AMD but dead slow on 
Intel, since 6 years ago.