Re: [PATCH v2] x86/crc32: use builtins to improve code generation

Bill Wendling <morbo@xxxxxxxxxx> · Mon, 3 Mar 2025 12:27:21 -0800



On Mon, Mar 3, 2025 at 12:15 PM David Laight
<david.laight.linux@xxxxxxxxx> wrote:
> On Thu, 27 Feb 2025 15:47:03 -0800
> Bill Wendling <morbo@xxxxxxxxxx> wrote:
>
> > For both gcc and clang, crc32 builtins generate better code than the
> > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > does the same and unrolls the loops. GCC has no changes on i386, but
> > Clang's code generation is vastly improved, due to Clang's "rm"
> > constraint issue.
> >
> > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > is expected because of the "rm" issue. However, Clang's performance is
> > better than GCC's by ~1.5%, most likely due to loop unrolling.
>
> How much does it unroll?
> How much you need depends on the latency of the crc32 instruction.
> The copy of Agner's tables I have gives it a latency of 3 on
> pretty much everything.
> If you can only do one chained crc instruction every three clocks
> it is hard to see how unrolling the loop will help.
> Intel cpu (since sandy bridge) will run a two clock loop.
> With three clocks to play with it should be easy (even for a compiler)
> to generate a loop with no extra clock stalls.
>
> Clearly if Clang decides to copy arguments to the stack an extra time
> that will kill things. But in this case you want the "m" constraint
> to directly read from the buffer (with a (reg,reg,8) addressing mode).
>
Below is what Clang generates with the builtins. From what Eric said,
this code is only run for sizes <= 512 bytes? So maybe it's not super
important to micro-optimize this. I apologize, but my ability to
measure clock loops for x86 code isn't great. (I'm sure I lack the
requisite benchmarks, etc.)

-bw

.LBB1_9:                                # =>This Inner Loop Header: Depth=1
        movl    %ebx, %ebx
        crc32q  (%rcx), %rbx
        addq    $8, %rcx
        incq    %rdi
        cmpq    %rdi, %rsi
        jne     .LBB1_9
# %bb.10:
        subq    %rdi, %rax
        jmp     .LBB1_11
.LBB1_7:
        movq    %r14, %rcx
.LBB1_11:
        movq    %r15, %rsi
        andq    $-8, %rsi
        cmpq    $7, %rdx
        jb      .LBB1_14
# %bb.12:
        xorl    %edx, %edx
.LBB1_13:                               # =>This Inner Loop Header: Depth=1
        movl    %ebx, %ebx
        crc32q  (%rcx,%rdx,8), %rbx
        crc32q  8(%rcx,%rdx,8), %rbx
        crc32q  16(%rcx,%rdx,8), %rbx
        crc32q  24(%rcx,%rdx,8), %rbx
        crc32q  32(%rcx,%rdx,8), %rbx
        crc32q  40(%rcx,%rdx,8), %rbx
        crc32q  48(%rcx,%rdx,8), %rbx
        crc32q  56(%rcx,%rdx,8), %rbx
        addq    $8, %rdx
        cmpq    %rdx, %rax
        jne     .LBB1_13
.LBB1_14:
        addq    %rsi, %r14
.LBB1_15:
        andq    $7, %r15
        je      .LBB1_23
# %bb.16:
        crc32b  (%r14), %ebx
        cmpl    $1, %r15d
        je      .LBB1_23
# %bb.17:
        crc32b  1(%r14), %ebx
        cmpl    $2, %r15d
        je      .LBB1_23
# %bb.18:
        crc32b  2(%r14), %ebx
        cmpl    $3, %r15d
        je      .LBB1_23
# %bb.19:
        crc32b  3(%r14), %ebx
        cmpl    $4, %r15d
        je      .LBB1_23
# %bb.20:
        crc32b  4(%r14), %ebx
        cmpl    $5, %r15d
        je      .LBB1_23
# %bb.21:
        crc32b  5(%r14), %ebx
        cmpl    $6, %r15d
        je      .LBB1_23
# %bb.22:
        crc32b  6(%r14), %ebx
.LBB1_23:
        movl    %ebx, %eax
.LBB1_24: