On 05/10/2017 01:33, Mikhail Maltsev wrote: > On Tue, Oct 3, 2017 at 9:59 PM, Mason wrote: > >> On 03/10/2017 19:09, David Wohlferd wrote: >> >>> On 10/3/2017 6:53 AM, Mason wrote: >>> >>>> Consider the following code: >>>> >>>> int my_ctz(unsigned int arg) { return __builtin_ctz(arg); } >>>> >>>> which "gcc-7 -O -S -march=skylake" compiles to: >>>> >>>> my_ctz: >>>> xorl %eax, %eax >>>> tzcntl %edi, %eax >>>> ret >>>> >>>> I don't understand why GCC clears eax before executing tzcnt. >>>> (Actually, this happens for other built-ins as well: clz, popcount.) >>>> >>>> tzcnt (or bsf) will write their result to eax. >>>> >>>> http://www.felixcloutier.com/x86/TZCNT.html >>>> http://www.felixcloutier.com/x86/BSF.html >>>> >>>> Does it have to do with partial register write stalls? >>>> Probably not, because the zero-ing remains even when the call >>>> is inlined, and gcc "sees" there are no partial register writes. >>> >>> Quoting from the docs on tzcnt: >>> >>> "in the case of BSF instruction, if source operand is zero, the >>> content of destination operand are undefined. On processors that do >>> not support TZCNT, the instruction byte encoding is executed as BSF." >>> >>> So BSF leaves the contents of eax undefined, and TZCNT might execute as >>> BSF. Given the trivial nature of xor eax, eax, this seems a sensible >>> precaution. >> >> Hello David, >> >> Your answer makes sense, but falls apart given the following: >> >> As I stated, "gcc-7 -O -S -march=skylake" generates >> >> my_ctz: >> xorl %eax, %eax >> tzcntl %edi, %eax >> ret >> >> But "gcc-7 -O -S -march=barcelona" generates >> >> my_ctz: >> bsfl %edi, %eax >> ret >> >> >> AMD Barcelona does not support tzcnt, yet GCC doesn't clear >> eax before executing bsf. The mystery remains :-) > > It might be because of the workaround for this hardware problem: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 Hello Mikhail, I think you've hit the nail on the head! :-) Regards.