Re: Why does __builtin_ctz clear eax on amd64 targets

Mason <slash.tmp@xxxxxxx> · Tue, 3 Oct 2017 20:59:30 +0200

On 03/10/2017 19:09, David Wohlferd wrote:

> On 10/3/2017 6:53 AM, Mason wrote:
>
>> Consider the following code:
>>
>> int my_ctz(unsigned int arg) { return __builtin_ctz(arg); }
>>
>> which "gcc-7 -O -S -march=skylake" compiles to:
>>
>> my_ctz:
>> 	xorl	%eax, %eax
>> 	tzcntl	%edi, %eax
>> 	ret
>>
>> I don't understand why GCC clears eax before executing tzcnt.
>> (Actually, this happens for other built-ins as well: clz, popcount.)
>>
>> tzcnt (or bsf) will write their result to eax.
>>
>> http://www.felixcloutier.com/x86/TZCNT.html
>> http://www.felixcloutier.com/x86/BSF.html
>>
>> Does it have to do with partial register write stalls?
>> Probably not, because the zero-ing remains even when the call
>> is inlined, and gcc "sees" there are no partial register writes.
>
> Quoting from the docs on tzcnt:
> 
> "in the case of BSF instruction, if source operand is zero, the
> content of destination operand are undefined. On processors that do
> not support TZCNT, the instruction byte encoding is executed as BSF."
> 
> So BSF leaves the contents of eax undefined, and TZCNT might execute as 
> BSF.  Given the trivial nature of xor eax, eax, this seems a sensible 
> precaution.

Hello David,

Your answer makes sense, but falls apart given the following:

As I stated, "gcc-7 -O -S -march=skylake" generates

my_ctz:
	xorl	%eax, %eax
	tzcntl	%edi, %eax
	ret

But "gcc-7 -O -S -march=barcelona" generates

my_ctz:
	bsfl	%edi, %eax
	ret

AMD Barcelona does not support tzcnt, yet GCC doesn't clear
eax before executing bsf. The mystery remains :-)

Regards.