On Tue, Oct 3, 2017 at 9:59 PM, Mason <slash.tmp@xxxxxxx> wrote: > On 03/10/2017 19:09, David Wohlferd wrote: > > > On 10/3/2017 6:53 AM, Mason wrote: > > > >> Consider the following code: > >> > >> int my_ctz(unsigned int arg) { return __builtin_ctz(arg); } > >> > >> which "gcc-7 -O -S -march=skylake" compiles to: > >> > >> my_ctz: > >> xorl %eax, %eax > >> tzcntl %edi, %eax > >> ret > >> > >> I don't understand why GCC clears eax before executing tzcnt. > >> (Actually, this happens for other built-ins as well: clz, popcount.) > >> > >> tzcnt (or bsf) will write their result to eax. > >> > >> http://www.felixcloutier.com/x86/TZCNT.html > >> http://www.felixcloutier.com/x86/BSF.html > >> > >> Does it have to do with partial register write stalls? > >> Probably not, because the zero-ing remains even when the call > >> is inlined, and gcc "sees" there are no partial register writes. > > > > Quoting from the docs on tzcnt: > > > > "in the case of BSF instruction, if source operand is zero, the > > content of destination operand are undefined. On processors that do > > not support TZCNT, the instruction byte encoding is executed as BSF." > > > > So BSF leaves the contents of eax undefined, and TZCNT might execute as > > BSF. Given the trivial nature of xor eax, eax, this seems a sensible > > precaution. > > Hello David, > > Your answer makes sense, but falls apart given the following: > > As I stated, "gcc-7 -O -S -march=skylake" generates > > my_ctz: > xorl %eax, %eax > tzcntl %edi, %eax > ret > > But "gcc-7 -O -S -march=barcelona" generates > > my_ctz: > bsfl %edi, %eax > ret > > > AMD Barcelona does not support tzcnt, yet GCC doesn't clear > eax before executing bsf. The mystery remains :-) > It might be because of the workaround for this hardware problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011; -- Regards, Mikhail Maltsev