On Mon, 24 Feb 2025 14:27:03 -0500 Yury Norov <yury.norov@xxxxxxxxx> wrote: .... > +#define parity(val) \ > +({ \ > + u64 __v = (val); \ > + int __ret; \ > + switch (BITS_PER_TYPE(val)) { \ > + case 64: \ > + __v ^= __v >> 32; \ > + fallthrough; \ > + case 32: \ > + __v ^= __v >> 16; \ > + fallthrough; \ > + case 16: \ > + __v ^= __v >> 8; \ > + fallthrough; \ > + case 8: \ > + __v ^= __v >> 4; \ > + __ret = (0x6996 >> (__v & 0xf)) & 1; \ > + break; \ > + default: \ > + BUILD_BUG(); \ > + } \ > + __ret; \ > +}) > + You really don't want to do that! gcc makes a right hash of it for x86 (32bit). See https://www.godbolt.org/z/jG8dv3cvs You do better using a __v32 after the 64bit xor. Even the 64bit version is probably sub-optimal (both gcc and clang). The whole lot ends up being a bit single register dependency chain. You want to do: mov %eax, %edx shrl $n, %eax xor %edx, %eax so that the 'mov' and 'shrl' can happen in the same clock (without relying on the register-register move being optimised out). I dropped in the arm64 for an example of where the magic shift of 6996 just adds an extra instruction. David