On Mon, Mar 3, 2025 at 12:15 PM David Laight <david.laight.linux@xxxxxxxxx> wrote: > On Thu, 27 Feb 2025 15:47:03 -0800 > Bill Wendling <morbo@xxxxxxxxxx> wrote: > > > For both gcc and clang, crc32 builtins generate better code than the > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang > > does the same and unrolls the loops. GCC has no changes on i386, but > > Clang's code generation is vastly improved, due to Clang's "rm" > > constraint issue. > > > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which > > is expected because of the "rm" issue. However, Clang's performance is > > better than GCC's by ~1.5%, most likely due to loop unrolling. > > How much does it unroll? > How much you need depends on the latency of the crc32 instruction. > The copy of Agner's tables I have gives it a latency of 3 on > pretty much everything. > If you can only do one chained crc instruction every three clocks > it is hard to see how unrolling the loop will help. > Intel cpu (since sandy bridge) will run a two clock loop. > With three clocks to play with it should be easy (even for a compiler) > to generate a loop with no extra clock stalls. > > Clearly if Clang decides to copy arguments to the stack an extra time > that will kill things. But in this case you want the "m" constraint > to directly read from the buffer (with a (reg,reg,8) addressing mode). > Below is what Clang generates with the builtins. From what Eric said, this code is only run for sizes <= 512 bytes? So maybe it's not super important to micro-optimize this. I apologize, but my ability to measure clock loops for x86 code isn't great. (I'm sure I lack the requisite benchmarks, etc.) -bw .LBB1_9: # =>This Inner Loop Header: Depth=1 movl %ebx, %ebx crc32q (%rcx), %rbx addq $8, %rcx incq %rdi cmpq %rdi, %rsi jne .LBB1_9 # %bb.10: subq %rdi, %rax jmp .LBB1_11 .LBB1_7: movq %r14, %rcx .LBB1_11: movq %r15, %rsi andq $-8, %rsi cmpq $7, %rdx jb .LBB1_14 # %bb.12: xorl %edx, %edx .LBB1_13: # =>This Inner Loop Header: Depth=1 movl %ebx, %ebx crc32q (%rcx,%rdx,8), %rbx crc32q 8(%rcx,%rdx,8), %rbx crc32q 16(%rcx,%rdx,8), %rbx crc32q 24(%rcx,%rdx,8), %rbx crc32q 32(%rcx,%rdx,8), %rbx crc32q 40(%rcx,%rdx,8), %rbx crc32q 48(%rcx,%rdx,8), %rbx crc32q 56(%rcx,%rdx,8), %rbx addq $8, %rdx cmpq %rdx, %rax jne .LBB1_13 .LBB1_14: addq %rsi, %r14 .LBB1_15: andq $7, %r15 je .LBB1_23 # %bb.16: crc32b (%r14), %ebx cmpl $1, %r15d je .LBB1_23 # %bb.17: crc32b 1(%r14), %ebx cmpl $2, %r15d je .LBB1_23 # %bb.18: crc32b 2(%r14), %ebx cmpl $3, %r15d je .LBB1_23 # %bb.19: crc32b 3(%r14), %ebx cmpl $4, %r15d je .LBB1_23 # %bb.20: crc32b 4(%r14), %ebx cmpl $5, %r15d je .LBB1_23 # %bb.21: crc32b 5(%r14), %ebx cmpl $6, %r15d je .LBB1_23 # %bb.22: crc32b 6(%r14), %ebx .LBB1_23: movl %ebx, %eax .LBB1_24: