Re: [PATCH v2] x86/crc32: use builtins to improve code generation

"H. Peter Anvin" <hpa@xxxxxxxxx> · Mon, 03 Mar 2025 15:57:57 -0800

On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@xxxxxxxxx> wrote:
>On Mon, 3 Mar 2025 12:27:21 -0800
>Bill Wendling <morbo@xxxxxxxxxx> wrote:
>
>> On Mon, Mar 3, 2025 at 12:15 PM David Laight
>> <david.laight.linux@xxxxxxxxx> wrote:
>> > On Thu, 27 Feb 2025 15:47:03 -0800
>> > Bill Wendling <morbo@xxxxxxxxxx> wrote:
>> >  
>> > > For both gcc and clang, crc32 builtins generate better code than the
>> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
>> > > does the same and unrolls the loops. GCC has no changes on i386, but
>> > > Clang's code generation is vastly improved, due to Clang's "rm"
>> > > constraint issue.
>> > >
>> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
>> > > is expected because of the "rm" issue. However, Clang's performance is
>> > > better than GCC's by ~1.5%, most likely due to loop unrolling.  
>> >
>> > How much does it unroll?
>> > How much you need depends on the latency of the crc32 instruction.
>> > The copy of Agner's tables I have gives it a latency of 3 on
>> > pretty much everything.
>> > If you can only do one chained crc instruction every three clocks
>> > it is hard to see how unrolling the loop will help.
>> > Intel cpu (since sandy bridge) will run a two clock loop.
>> > With three clocks to play with it should be easy (even for a compiler)
>> > to generate a loop with no extra clock stalls.
>> >
>> > Clearly if Clang decides to copy arguments to the stack an extra time
>> > that will kill things. But in this case you want the "m" constraint
>> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
>> >  
>> Below is what Clang generates with the builtins. From what Eric said,
>> this code is only run for sizes <= 512 bytes? So maybe it's not super
>> important to micro-optimize this. I apologize, but my ability to
>> measure clock loops for x86 code isn't great. (I'm sure I lack the
>> requisite benchmarks, etc.)
>
>Jeepers - that is trashing the I-cache.
>Not to mention all the conditional branches at the bottom.
>Consider the basic loop:
>1:	crc32q	(%rcx), %rbx
>	addq	$8, %rcx
>	cmp	%rcx, %rdx
>	jne	1b
>The crc32 has latency 3 so it must take at least 3 clocks.
>Even naively the addq can be issued in the same clock as the crc32
>and the cmp and jne in the following ones.
>Since the jne is predicted taken, the addq can be assumed to execute
>in the same clock as the jne.
>(The cmp+jne might also get merged into a single u-op)
>(I've done this with adc (for IP checksum), with two adc the loop takes
>two clocks even with the extra memory reads.)
>
>So that loop is likely to run limited by the three clock latency of crc32.
>Even the memory reads will happen with all the crc32 just waiting for the
>previous crc32 to finish.
>You can take an instruction out of the loop:
>1:	crc32q	(%rcx,%rdx), %rbx
>	addq	$8, %rdx
>	jne	1b
>but that may not be necessary, and (IIRC) gcc doesn't like letting you
>generate it.
>
>For buffers that aren't multiples of 8 bytes 'remember' that the crc of
>a byte depends on how far it is from the end of the buffer, and that initial
>zero bytes have no effect.
>So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
>right by the number of bytes needed to make the rest of the buffer a multiple
>or 8 bytes (the same as reading from across the start of the buffer and masking
>the low bytes) then treat exactly the same as a buffer that is a multiple
>of 8 bytes long.
>Don't worry about misaligned reads, you lose less than one clock per cache
>line (that is with adc doing a read every clock).
>
>Actually measuring the performance is hard.
>You can use rdtsc because the clock speed will change when the cpu gets busy.
>There is a 'performance counter' that is actual clocks.
>While you can use the library functions to set it up, you need to just read the
>register - the library overhead it too big.
>You also need the odd lfence.
>Having done that, and provided the buffer is in the L1 d-cache you can measure
>the loop time in clocks and compare against the expected value.
>Once you've got 3 clocks per crc32 instruction it won't get any better,
>which is why the 'fast' code for big buffers does crc of 3+ buffers sections
>in parallel.
>
>	David
>
>> 
>> -bw
>> 
>> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
>>         movl    %ebx, %ebx
>>         crc32q  (%rcx), %rbx
>>         addq    $8, %rcx
>>         incq    %rdi
>>         cmpq    %rdi, %rsi
>>         jne     .LBB1_9
>> # %bb.10:
>>         subq    %rdi, %rax
>>         jmp     .LBB1_11
>> .LBB1_7:
>>         movq    %r14, %rcx
>> .LBB1_11:
>>         movq    %r15, %rsi
>>         andq    $-8, %rsi
>>         cmpq    $7, %rdx
>>         jb      .LBB1_14
>> # %bb.12:
>>         xorl    %edx, %edx
>> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
>>         movl    %ebx, %ebx
>>         crc32q  (%rcx,%rdx,8), %rbx
>>         crc32q  8(%rcx,%rdx,8), %rbx
>>         crc32q  16(%rcx,%rdx,8), %rbx
>>         crc32q  24(%rcx,%rdx,8), %rbx
>>         crc32q  32(%rcx,%rdx,8), %rbx
>>         crc32q  40(%rcx,%rdx,8), %rbx
>>         crc32q  48(%rcx,%rdx,8), %rbx
>>         crc32q  56(%rcx,%rdx,8), %rbx
>>         addq    $8, %rdx
>>         cmpq    %rdx, %rax
>>         jne     .LBB1_13
>> .LBB1_14:
>>         addq    %rsi, %r14
>> .LBB1_15:
>>         andq    $7, %r15
>>         je      .LBB1_23
>> # %bb.16:
>>         crc32b  (%r14), %ebx
>>         cmpl    $1, %r15d
>>         je      .LBB1_23
>> # %bb.17:
>>         crc32b  1(%r14), %ebx
>>         cmpl    $2, %r15d
>>         je      .LBB1_23
>> # %bb.18:
>>         crc32b  2(%r14), %ebx
>>         cmpl    $3, %r15d
>>         je      .LBB1_23
>> # %bb.19:
>>         crc32b  3(%r14), %ebx
>>         cmpl    $4, %r15d
>>         je      .LBB1_23
>> # %bb.20:
>>         crc32b  4(%r14), %ebx
>>         cmpl    $5, %r15d
>>         je      .LBB1_23
>> # %bb.21:
>>         crc32b  5(%r14), %ebx
>>         cmpl    $6, %r15d
>>         je      .LBB1_23
>> # %bb.22:
>>         crc32b  6(%r14), %ebx
>> .LBB1_23:
>>         movl    %ebx, %eax
>> .LBB1_24:
>
>

The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?