On 03/10/2017 19:09, David Wohlferd wrote: > On 10/3/2017 6:53 AM, Mason wrote: > >> Consider the following code: >> >> int my_ctz(unsigned int arg) { return __builtin_ctz(arg); } >> >> which "gcc-7 -O -S -march=skylake" compiles to: >> >> my_ctz: >> xorl %eax, %eax >> tzcntl %edi, %eax >> ret >> >> I don't understand why GCC clears eax before executing tzcnt. >> (Actually, this happens for other built-ins as well: clz, popcount.) >> >> tzcnt (or bsf) will write their result to eax. >> >> http://www.felixcloutier.com/x86/TZCNT.html >> http://www.felixcloutier.com/x86/BSF.html >> >> Does it have to do with partial register write stalls? >> Probably not, because the zero-ing remains even when the call >> is inlined, and gcc "sees" there are no partial register writes. > > Quoting from the docs on tzcnt: > > "in the case of BSF instruction, if source operand is zero, the > content of destination operand are undefined. On processors that do > not support TZCNT, the instruction byte encoding is executed as BSF." > > So BSF leaves the contents of eax undefined, and TZCNT might execute as > BSF. Given the trivial nature of xor eax, eax, this seems a sensible > precaution. Hello David, Your answer makes sense, but falls apart given the following: As I stated, "gcc-7 -O -S -march=skylake" generates my_ctz: xorl %eax, %eax tzcntl %edi, %eax ret But "gcc-7 -O -S -march=barcelona" generates my_ctz: bsfl %edi, %eax ret AMD Barcelona does not support tzcnt, yet GCC doesn't clear eax before executing bsf. The mystery remains :-) Regards.