Hello, Consider the following code: int my_ctz(unsigned int arg) { return __builtin_ctz(arg); } which "gcc-7 -O -S -march=skylake" compiles to: my_ctz: xorl %eax, %eax tzcntl %edi, %eax ret I don't understand why GCC clears eax before executing tzcnt. (Actually, this happens for other built-ins as well: clz, popcount.) tzcnt (or bsf) will write their result to eax. http://www.felixcloutier.com/x86/TZCNT.html http://www.felixcloutier.com/x86/BSF.html Does it have to do with partial register write stalls? Probably not, because the zero-ing remains even when the call is inlined, and gcc "sees" there are no partial register writes. Regards.