Re: Using bt,bts

Ondřej Bílka <neleai@xxxxxxxxx> · Fri, 28 Sep 2012 17:40:58 +0200



On Thu, Sep 27, 2012 at 10:52:48AM -0700, Ian Lance Taylor wrote:
> On Thu, Sep 27, 2012 at 12:35 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote:
> > On Wed, Sep 26, 2012 at 04:20:52PM -0700, Ian Lance Taylor wrote:
> >> On Wed, Sep 26, 2012 at 10:34 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote:
> >>
> >> > is there a reason why for example
> >> > x=x|(1<<11);
> >> > is not expanded into
> >> > bts rax,11
> >> > ?
> >>
> >> The bts instruction is never faster than the corresponding or
> >> instruction.  There's no reason to use it when setting a bit in the
> >> low 32 bits.
> >>
> >> Ian
> > Following benchmarks tells otherwise. On ivy bridge bts variant is twice
> > faster than doing or.
> >
> > I used
> >
> >  for(i=0;i<1000000;i++)
> >     x=x|(1<<i);
> 
> That is a rather odd benchmark.  Almost all of the loop iterations
> will do nothing because the 1 will be left shifted into nothingness.
>From intel reference manual: 
Description
Shifts the bits in the first operand (destination operand) to the left
or right by the
number of bits specified in the second operand (count operand). Bits
shifted beyond
the destination operand boundary are first shifted into the CF flag,
then discarded. At
the end of the shift operation, the CF flag contains the last bit
shifted out of the destination
operand.
The destination operand can be a register or a memory location. The
count operand
can be an immediate value or the CL register. The count is masked to 5
bits (or 6 bits
if in 64-bit mode and REX.W is used). The count range is limited to 0 to
31 (or 63 if
64-bit mode and REX.W is used). A special opcode encoding is provided
for a count
of 1. 
> 
> And if you look back at what I said, I said they were equivalent when
> setting one of the low order 32 bits, which is what was happening in
> your original code.
I did not say that i set lower 32 bits nor did I say that position I set
is constant.
> 
> 
> > implemented as
> >
> > .globl main
> >   .type main, @function
> > main:
> > .LFB0:
> >   .cfi_startproc
> >   xorl  %eax, %eax
> >   xorl  %ecx, %ecx
> >   movl  $1, %edx
> >   .p2align 4,,10
> >   .p2align 3
> > .L2:
> >   bts %ecx, %edx
> >   addl  $1, %ecx
> >   cmpl  $100000000, %ecx
> >   jne .L2
> >   rep
> >   ret
> > .cfi_endproc
> >
> > and
> >
> > .globl main
> >   .type main, @function
> > main:
> > .LFB0:
> >   .cfi_startproc
> >   xorl  %eax, %eax
> >   xorl  %ecx, %ecx
> >   movl  $1, %edx
> >   .p2align 4,,10
> >   .p2align 3
> > .L2:
> >   movl  %edx, %esi
> >   sall  %cl, %esi
> >   addl  $1, %ecx
> >   orl %esi, %eax
> >   cmpl  $100000000, %ecx
> >   jne .L2
> >   rep
> >   ret
> > .cfi_endproc
> 
> Those loops are not equivalent even apart from bts vs. ori.  One has
> four instructions, the other has six.
Two functions are equivalent if and only if for every input they produce
same output. That one consist of 10 instructions while other 8 is
irrelevant.
> 
> Ian