From: Paolo Bonzini > Sent: 15 April 2019 09:12 > On 11/04/19 11:06, David Laight wrote: > > It may be possible to generate shorter code that executes just as > > fast by generating a single constant and deriving the others from it. > > - generate 4s - needed first > > - shift right 2 to get 1s (in parallel with the xor) > > - use lea to get 6s (in parallel with an lea to do the add) > > - invert the 1s to get FEs (also in parallel with the add) > > - xor the FEs with the 6s to get F8s (in parallel with the or) > > - and/test for the result That version needs an extra register move I hadn't allowed for. It is also impossible to stop gcc folding constant expressions without an asm nop on a register. > FWIW, here is yet another way to do it: > > /* Change 6/7 to 4/5 */ > data &= ~((data & 0x0404040404040404ULL) >> 1); > /* Only allow 0/1/4/5 now */ > return !(data & 0xFAFAFAFAFAFAFAFAULL); > > movabs $0x404040404040404, %rcx > andq %rdx, %rcx > shrq %rcx > notq %rcx > movabs $0xFAFAFAFAFAFAFA, %rax > andq %rcx, %rdx > test %rax, %rdx Fewer opcode bytes, but 5 dependant instructions (assuming the first constant can executed in parallel with an earlier instruction). I think my one was only 4 dependant instructions. All these are far faster than the loop... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)