Re: [PATCH v2] x86/crc32: use builtins to improve code generation

David Laight <david.laight.linux@xxxxxxxxx> · Mon, 3 Mar 2025 20:15:09 +0000

On Thu, 27 Feb 2025 15:47:03 -0800
Bill Wendling <morbo@xxxxxxxxxx> wrote:

> For both gcc and clang, crc32 builtins generate better code than the
> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> does the same and unrolls the loops. GCC has no changes on i386, but
> Clang's code generation is vastly improved, due to Clang's "rm"
> constraint issue.
> 
> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> is expected because of the "rm" issue. However, Clang's performance is
> better than GCC's by ~1.5%, most likely due to loop unrolling.

How much does it unroll?
How much you need depends on the latency of the crc32 instruction.
The copy of Agner's tables I have gives it a latency of 3 on
pretty much everything.
If you can only do one chained crc instruction every three clocks
it is hard to see how unrolling the loop will help.
Intel cpu (since sandy bridge) will run a two clock loop.
With three clocks to play with it should be easy (even for a compiler)
to generate a loop with no extra clock stalls.

Clearly if Clang decides to copy arguments to the stack an extra time
that will kill things. But in this case you want the "m" constraint
to directly read from the buffer (with a (reg,reg,8) addressing mode).

	David