Jeroen van Bemmel <jbemmel@xxxxxxxxx> writes: > I have ported an SSE4 strcmp function from > http://www.strchr.com/strcmp_and_strlen_using_sse_4.2 > to GCC inline assembly: > > long __res; > __asm__ __volatile__( > "sub $16, %4 \n" > "1:\n" > "add $16, %4 \n" > "movdqu (%4), %%xmm0 \n" // Could use any XMM, using > register constraint "x" > // ".byte 0x48 \n" // REX prefix > with REX.w=1, to get result in RCX > "pcmpistri $0x18, (%4,%0), %%xmm0 \n" // > EQUAL_EACH(0x08) + NEGATIVE_POLARITY(0x10) > "ja 1b \n" > "jc 2f \n" > "xor %0, %0 \n" > "jmp 3f \n" // XXX Extra jump > could be avoided in pure asm > "2:\n" > "add %4, %0 \n" > "movzxb (%0,%1), %0 \n" > "movzxb (%4,%1), %4 \n" > "sub %4, %0 \n" > "3:\n" > : "=a"(__res), "=c"(cs) : "0"(cs-ct), "1"(0L), "r"(ct) : "xmm0" ); > > return (int) __res; > > The problem with this code is that "pcmpistri" returns its result in > ECX (i.e. the lower 32 bits of RCX), while the "movzxb" instructions > use the full RCX register. > One solution is to insert a REX prefix with REX.w bit set ( any gas > directive for this? ) Normally setting the low 32 bits of an x86 register will zero out the upper 32 bits. Is that not true for pcmpistri? Otherwise, it sounds like you want the addressing mode (%rax,%ecx). Does x86 really have that addressing mode? Why not just zero extend %ecx to %rcx? > However, I'd prefer to have gcc clear RCX at the beginning of the > function. The above code loads the "c" register with 0, but the > resulting asm code is > "xorl ecx, ecx" That instruction will indeed set %ecx to zero. Think about it. Ian