A Case Where CMOVcc Slower Than Jcc

"Lingchuan (LC) Meng" <lingchuanmeng@xxxxxxxxx> · Tue, 21 Jul 2009 16:42:46 -0400

Hi all,

I'm not sure this topic fits the gcc-help maillist. Please remove it
if it violates anything.

I managed to turn off the "-fif-conversion" in O2 optimization. And to
my surprise, I found that the codes with Jcc are slightly faster ( in
a certain big range ) than those with CMOVcc. I'm trying to explain
why Jcc's are faster in my case.

My C code is auto-generated, and filled with sections, typically in
'for' loops, like:

"
        s30480 = X[i20863];
        a59062 =  ( 512 + i20863 >= 12289 )?  ( 512 + i20863 - 12289 )
: ( 512 + i20863 ) ;
        s30481 = X[a59062];
        t55216 =  ( s30480 + s30481 >= 12289 )?  ( s30480 + s30481 -
12289 ) : ( s30480 + s30481 ) ;
        a59063 =  ( 12289 - s30481 ) ;
        t55217 =  ( s30480 + a59063 >= 12289 )?  ( s30480 + a59063 -
12289 ) : ( s30480 + a59063 ) ;
        a59064 =  ( 256 + i20863 >= 12289 )?  ( 256 + i20863 - 12289 )
: ( 256 + i20863 ) ;
"

I think that the ternary operations would be mapped to Jcc's with very
short branches. I further looked into AMD's techdoc "Software
Optimization Guide for AMD Family 10h Processors" (I'm using a Phenom
2 X4 ).
I noticed that the CPU fetches 16 bytes instructions each time. So my
guess is that most Jcc's short branches are fetched together, and
GCC's "-fif-conversion2" optimization applies "Conditional Execution"
wherever possible.

Thus, even though the penalty of pipeline flushing is very expensive,
most of the time the Jcc branches don't trigger a flushing as the
conditional has been executed to convert branches to branchless codes.

I also observed in the assembly that, for some bigger-sized code, a
few logically adjacent labels are not adjacently placed in the
assembly, which reduces the difference between Jcc and CMOVcc after a
certain threshold due to more frequent pipeline flushing.

Another reason of the decline in the difference could be that the data
reaches the cache boundary, and memory access becomes dominant in
performance.

Okay, sorry for this long email and potentially improper topic. Please
advise if you have some thoughts over this problem.

Thank you.

LC

-- 
Best regards,

Lingchuan Meng