On Thu, Sep 27, 2012 at 10:52:48AM -0700, Ian Lance Taylor wrote: > On Thu, Sep 27, 2012 at 12:35 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote: > > On Wed, Sep 26, 2012 at 04:20:52PM -0700, Ian Lance Taylor wrote: > >> On Wed, Sep 26, 2012 at 10:34 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote: > >> > >> > is there a reason why for example > >> > x=x|(1<<11); > >> > is not expanded into > >> > bts rax,11 > >> > ? > >> > >> The bts instruction is never faster than the corresponding or > >> instruction. There's no reason to use it when setting a bit in the > >> low 32 bits. > >> > >> Ian > > Following benchmarks tells otherwise. On ivy bridge bts variant is twice > > faster than doing or. > > > > I used > > > > for(i=0;i<1000000;i++) > > x=x|(1<<i); > > That is a rather odd benchmark. Almost all of the loop iterations > will do nothing because the 1 will be left shifted into nothingness. >From intel reference manual: Description Shifts the bits in the first operand (destination operand) to the left or right by the number of bits specified in the second operand (count operand). Bits shifted beyond the destination operand boundary are first shifted into the CF flag, then discarded. At the end of the shift operation, the CF flag contains the last bit shifted out of the destination operand. The destination operand can be a register or a memory location. The count operand can be an immediate value or the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used). A special opcode encoding is provided for a count of 1. > > And if you look back at what I said, I said they were equivalent when > setting one of the low order 32 bits, which is what was happening in > your original code. I did not say that i set lower 32 bits nor did I say that position I set is constant. > > > > implemented as > > > > .globl main > > .type main, @function > > main: > > .LFB0: > > .cfi_startproc > > xorl %eax, %eax > > xorl %ecx, %ecx > > movl $1, %edx > > .p2align 4,,10 > > .p2align 3 > > .L2: > > bts %ecx, %edx > > addl $1, %ecx > > cmpl $100000000, %ecx > > jne .L2 > > rep > > ret > > .cfi_endproc > > > > and > > > > .globl main > > .type main, @function > > main: > > .LFB0: > > .cfi_startproc > > xorl %eax, %eax > > xorl %ecx, %ecx > > movl $1, %edx > > .p2align 4,,10 > > .p2align 3 > > .L2: > > movl %edx, %esi > > sall %cl, %esi > > addl $1, %ecx > > orl %esi, %eax > > cmpl $100000000, %ecx > > jne .L2 > > rep > > ret > > .cfi_endproc > > Those loops are not equivalent even apart from bts vs. ori. One has > four instructions, the other has six. Two functions are equivalent if and only if for every input they produce same output. That one consist of 10 instructions while other 8 is irrelevant. > > Ian