On Thu, Sep 27, 2012 at 12:35 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote: > On Wed, Sep 26, 2012 at 04:20:52PM -0700, Ian Lance Taylor wrote: >> On Wed, Sep 26, 2012 at 10:34 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote: >> >> > is there a reason why for example >> > x=x|(1<<11); >> > is not expanded into >> > bts rax,11 >> > ? >> >> The bts instruction is never faster than the corresponding or >> instruction. There's no reason to use it when setting a bit in the >> low 32 bits. >> >> Ian > Following benchmarks tells otherwise. On ivy bridge bts variant is twice > faster than doing or. > > I used > > for(i=0;i<1000000;i++) > x=x|(1<<i); That is a rather odd benchmark. Almost all of the loop iterations will do nothing because the 1 will be left shifted into nothingness. And if you look back at what I said, I said they were equivalent when setting one of the low order 32 bits, which is what was happening in your original code. > implemented as > > .globl main > .type main, @function > main: > .LFB0: > .cfi_startproc > xorl %eax, %eax > xorl %ecx, %ecx > movl $1, %edx > .p2align 4,,10 > .p2align 3 > .L2: > bts %ecx, %edx > addl $1, %ecx > cmpl $100000000, %ecx > jne .L2 > rep > ret > .cfi_endproc > > and > > .globl main > .type main, @function > main: > .LFB0: > .cfi_startproc > xorl %eax, %eax > xorl %ecx, %ecx > movl $1, %edx > .p2align 4,,10 > .p2align 3 > .L2: > movl %edx, %esi > sall %cl, %esi > addl $1, %ecx > orl %esi, %eax > cmpl $100000000, %ecx > jne .L2 > rep > ret > .cfi_endproc Those loops are not equivalent even apart from bts vs. ori. One has four instructions, the other has six. Ian