On Thu, 27 Feb 2025 15:47:03 -0800 Bill Wendling <morbo@xxxxxxxxxx> wrote: > For both gcc and clang, crc32 builtins generate better code than the > inline asm. GCC improves, removing unneeded "mov" instructions. Clang > does the same and unrolls the loops. GCC has no changes on i386, but > Clang's code generation is vastly improved, due to Clang's "rm" > constraint issue. > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which > is expected because of the "rm" issue. However, Clang's performance is > better than GCC's by ~1.5%, most likely due to loop unrolling. How much does it unroll? How much you need depends on the latency of the crc32 instruction. The copy of Agner's tables I have gives it a latency of 3 on pretty much everything. If you can only do one chained crc instruction every three clocks it is hard to see how unrolling the loop will help. Intel cpu (since sandy bridge) will run a two clock loop. With three clocks to play with it should be easy (even for a compiler) to generate a loop with no extra clock stalls. Clearly if Clang decides to copy arguments to the stack an extra time that will kill things. But in this case you want the "m" constraint to directly read from the buffer (with a (reg,reg,8) addressing mode). David