Re: Generated ASM of a typical clamp

Marc Glisse <marc.glisse@xxxxxxxx> · Mon, 10 Nov 2014 15:23:16 +0100 (CET)

On Mon, 10 Nov 2014, NightStrike wrote:

On Mon, Oct 20, 2014 at 7:18 PM, NightStrike <nightstrike@xxxxxxxxx> wrote:
I have been studying the asm generated by a typical clamping function,
and I am confused about the results.  This is done on an Opteron 6k
series compiled with -fverbose-asm, -O3 and -march=native.

float clamp(float const x, float const min, float const max) {
#if defined (BRANCH)
  if ( x > max )
    return max;
  else if ( x < min )
    return min;
  else
    return x;
#elif defined (BRANCH2)
  return x > max ? max : ( x < min ? min : x );
#elif defined (CALL)
  return __builtin_fminf(__builtin_fmaxf(x, min), max);
#else
  float const t = x < min ? min : x;
  return t> max ? max : t;
#endif
}

-DBRANCH / -DBRANCH2:
The first two approaches are obviously identical, and produce:

clamp:
.LFB0:
        .cfi_startproc
        vucomiss        %xmm2, %xmm0    # max, x
        ja      .L3     #,
        vmaxss  %xmm0, %xmm1, %xmm0     # x, min, D.2214
        ret
        .p2align 4,,7
        .p2align 3
.L3:
        vmovaps %xmm2, %xmm0    # max, D.2214
        ret
        .cfi_endproc

-DCALL:
This one I figured would be great, given the use of builtins:

clamp:
.LFB0:
        .cfi_startproc
        subq    $24, %rsp       #,
        .cfi_def_cfa_offset 32
        vmovss  %xmm2, 12(%rsp) # max, %sfp
        call    fmaxf   #
        vmovss  12(%rsp), %xmm2 # %sfp, max
        addq    $24, %rsp       #,
        .cfi_def_cfa_offset 8
        vmovaps %xmm2, %xmm1    # max,
        jmp     fminf   #
        .cfi_endproc

I guess -ffast-math (or some weaker option) would let it generate the same 
as the next version.

But then we have what appears to be the best of them all....  just a
couple instructions, no branches, no calls, nothing:

.LFB0:
        .cfi_startproc
        vmaxss  %xmm0, %xmm1, %xmm0     # x, min, D.2219
        vminss  %xmm0, %xmm2, %xmm0     # D.2219, max, D.2219
        ret
        .cfi_endproc

So I'm curious.... why is the last approach optimized better than the
naive approach of some nested if statements?

Optimization is done locally, it detects a "max" pattern and a "min" 
pattern. The BRANCH versions are a more complicated pattern. There is code 
in phiopt that is supposed to handle it, but apparently it requires that 
we can prove at compile-time that min<=max. If you believe it can be 
generalized, you could file an enhancement PR with details about what to 
transform to what, and why this is valid whatever the ordering of x, min 
and max.

--
Marc Glisse