Re: [PATCH v2] x86/crc32: use builtins to improve code generation

"H. Peter Anvin" <hpa@xxxxxxxxx> · Mon, 03 Mar 2025 16:43:30 -0800



On March 3, 2025 4:16:43 PM PST, Bill Wendling <morbo@xxxxxxxxxx> wrote:
>On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>> On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@xxxxxxxxx> wrote:
>> >On Mon, 3 Mar 2025 12:27:21 -0800
>> >Bill Wendling <morbo@xxxxxxxxxx> wrote:
>> >
>> >> On Mon, Mar 3, 2025 at 12:15 PM David Laight
>> >> <david.laight.linux@xxxxxxxxx> wrote:
>> >> > On Thu, 27 Feb 2025 15:47:03 -0800
>> >> > Bill Wendling <morbo@xxxxxxxxxx> wrote:
>> >> >
>> >> > > For both gcc and clang, crc32 builtins generate better code than the
>> >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
>> >> > > does the same and unrolls the loops. GCC has no changes on i386, but
>> >> > > Clang's code generation is vastly improved, due to Clang's "rm"
>> >> > > constraint issue.
>> >> > >
>> >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
>> >> > > is expected because of the "rm" issue. However, Clang's performance is
>> >> > > better than GCC's by ~1.5%, most likely due to loop unrolling.
>> >> >
>> >> > How much does it unroll?
>> >> > How much you need depends on the latency of the crc32 instruction.
>> >> > The copy of Agner's tables I have gives it a latency of 3 on
>> >> > pretty much everything.
>> >> > If you can only do one chained crc instruction every three clocks
>> >> > it is hard to see how unrolling the loop will help.
>> >> > Intel cpu (since sandy bridge) will run a two clock loop.
>> >> > With three clocks to play with it should be easy (even for a compiler)
>> >> > to generate a loop with no extra clock stalls.
>> >> >
>> >> > Clearly if Clang decides to copy arguments to the stack an extra time
>> >> > that will kill things. But in this case you want the "m" constraint
>> >> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
>> >> >
>> >> Below is what Clang generates with the builtins. From what Eric said,
>> >> this code is only run for sizes <= 512 bytes? So maybe it's not super
>> >> important to micro-optimize this. I apologize, but my ability to
>> >> measure clock loops for x86 code isn't great. (I'm sure I lack the
>> >> requisite benchmarks, etc.)
>> >
>> >Jeepers - that is trashing the I-cache.
>> >Not to mention all the conditional branches at the bottom.
>> >Consider the basic loop:
>> >1:     crc32q  (%rcx), %rbx
>> >       addq    $8, %rcx
>> >       cmp     %rcx, %rdx
>> >       jne     1b
>> >The crc32 has latency 3 so it must take at least 3 clocks.
>> >Even naively the addq can be issued in the same clock as the crc32
>> >and the cmp and jne in the following ones.
>> >Since the jne is predicted taken, the addq can be assumed to execute
>> >in the same clock as the jne.
>> >(The cmp+jne might also get merged into a single u-op)
>> >(I've done this with adc (for IP checksum), with two adc the loop takes
>> >two clocks even with the extra memory reads.)
>> >
>> >So that loop is likely to run limited by the three clock latency of crc32.
>> >Even the memory reads will happen with all the crc32 just waiting for the
>> >previous crc32 to finish.
>> >You can take an instruction out of the loop:
>> >1:     crc32q  (%rcx,%rdx), %rbx
>> >       addq    $8, %rdx
>> >       jne     1b
>> >but that may not be necessary, and (IIRC) gcc doesn't like letting you
>> >generate it.
>> >
>> >For buffers that aren't multiples of 8 bytes 'remember' that the crc of
>> >a byte depends on how far it is from the end of the buffer, and that initial
>> >zero bytes have no effect.
>> >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
>> >right by the number of bytes needed to make the rest of the buffer a multiple
>> >or 8 bytes (the same as reading from across the start of the buffer and masking
>> >the low bytes) then treat exactly the same as a buffer that is a multiple
>> >of 8 bytes long.
>> >Don't worry about misaligned reads, you lose less than one clock per cache
>> >line (that is with adc doing a read every clock).
>> >
>For reference, GCC does much better with code gen, but only with the builtin:
>
>.L39:
>        crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
>        addq    $8, %rax        #, p
>        cmpq    %rcx, %rax      # _37, p
>        jne     .L39    #,
>        leaq    (%rsi,%rdi,8), %rsi     #, p
>.L38:
>        andl    $7, %edx        #, len
>        je      .L41    #,
>        addq    %rsi, %rdx      # p, _11
>        movl    %ebx, %eax      # crc, <retval>
>        .p2align 4
>.L40:
>        crc32b  (%rsi), %eax    # MEM[(const u8 *)p_45], <retval>
>        addq    $1, %rsi        #, p
>        cmpq    %rsi, %rdx      # p, _11
>        jne     .L40    #,
>
>> >Actually measuring the performance is hard.
>> >You can use rdtsc because the clock speed will change when the cpu gets busy.
>> >There is a 'performance counter' that is actual clocks.
>> >While you can use the library functions to set it up, you need to just read the
>> >register - the library overhead it too big.
>> >You also need the odd lfence.
>> >Having done that, and provided the buffer is in the L1 d-cache you can measure
>> >the loop time in clocks and compare against the expected value.
>> >Once you've got 3 clocks per crc32 instruction it won't get any better,
>> >which is why the 'fast' code for big buffers does crc of 3+ buffers sections
>> >in parallel.
>> >
>Thanks for the info! It'll help a lot the next time I need to delve
>deeply into performance.
>
>I tried using rdtsc and another programmatic way of measuring timing.
>Also tried making the task have high priority, restricting to one CPU,
>etc. But the numbers weren't as consistent as I wanted them to be. The
>times I reported were the based on the fastest times / clocks /
>whatever from several runs for each build.
>
>> >       David
>> >
>> >>
>> >> -bw
>> >>
>> >> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
>> >>         movl    %ebx, %ebx
>> >>         crc32q  (%rcx), %rbx
>> >>         addq    $8, %rcx
>> >>         incq    %rdi
>> >>         cmpq    %rdi, %rsi
>> >>         jne     .LBB1_9
>> >> # %bb.10:
>> >>         subq    %rdi, %rax
>> >>         jmp     .LBB1_11
>> >> .LBB1_7:
>> >>         movq    %r14, %rcx
>> >> .LBB1_11:
>> >>         movq    %r15, %rsi
>> >>         andq    $-8, %rsi
>> >>         cmpq    $7, %rdx
>> >>         jb      .LBB1_14
>> >> # %bb.12:
>> >>         xorl    %edx, %edx
>> >> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
>> >>         movl    %ebx, %ebx
>> >>         crc32q  (%rcx,%rdx,8), %rbx
>> >>         crc32q  8(%rcx,%rdx,8), %rbx
>> >>         crc32q  16(%rcx,%rdx,8), %rbx
>> >>         crc32q  24(%rcx,%rdx,8), %rbx
>> >>         crc32q  32(%rcx,%rdx,8), %rbx
>> >>         crc32q  40(%rcx,%rdx,8), %rbx
>> >>         crc32q  48(%rcx,%rdx,8), %rbx
>> >>         crc32q  56(%rcx,%rdx,8), %rbx
>> >>         addq    $8, %rdx
>> >>         cmpq    %rdx, %rax
>> >>         jne     .LBB1_13
>> >> .LBB1_14:
>> >>         addq    %rsi, %r14
>> >> .LBB1_15:
>> >>         andq    $7, %r15
>> >>         je      .LBB1_23
>> >> # %bb.16:
>> >>         crc32b  (%r14), %ebx
>> >>         cmpl    $1, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.17:
>> >>         crc32b  1(%r14), %ebx
>> >>         cmpl    $2, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.18:
>> >>         crc32b  2(%r14), %ebx
>> >>         cmpl    $3, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.19:
>> >>         crc32b  3(%r14), %ebx
>> >>         cmpl    $4, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.20:
>> >>         crc32b  4(%r14), %ebx
>> >>         cmpl    $5, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.21:
>> >>         crc32b  5(%r14), %ebx
>> >>         cmpl    $6, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.22:
>> >>         crc32b  6(%r14), %ebx
>> >> .LBB1_23:
>> >>         movl    %ebx, %eax
>> >> .LBB1_24:
>> >
>> >
>>
>> The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?
>
>Definitely on the weird side! I considered hard-coding something like
>that, but thought it might be a bit convoluted, though certainly less
>convoluted than what we generate now. A simple loop is probably all
>that's needed, because it should only need to be done at most seven
>times.
>
>-bw
>

4-2-1 makes more sense probably (4 bytes, then 2 bytes, then 1 byte depending on which bits are set.)