On March 3, 2025 4:16:43 PM PST, Bill Wendling <morbo@xxxxxxxxxx> wrote: >On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@xxxxxxxxx> wrote: >> On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@xxxxxxxxx> wrote: >> >On Mon, 3 Mar 2025 12:27:21 -0800 >> >Bill Wendling <morbo@xxxxxxxxxx> wrote: >> > >> >> On Mon, Mar 3, 2025 at 12:15 PM David Laight >> >> <david.laight.linux@xxxxxxxxx> wrote: >> >> > On Thu, 27 Feb 2025 15:47:03 -0800 >> >> > Bill Wendling <morbo@xxxxxxxxxx> wrote: >> >> > >> >> > > For both gcc and clang, crc32 builtins generate better code than the >> >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang >> >> > > does the same and unrolls the loops. GCC has no changes on i386, but >> >> > > Clang's code generation is vastly improved, due to Clang's "rm" >> >> > > constraint issue. >> >> > > >> >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which >> >> > > is expected because of the "rm" issue. However, Clang's performance is >> >> > > better than GCC's by ~1.5%, most likely due to loop unrolling. >> >> > >> >> > How much does it unroll? >> >> > How much you need depends on the latency of the crc32 instruction. >> >> > The copy of Agner's tables I have gives it a latency of 3 on >> >> > pretty much everything. >> >> > If you can only do one chained crc instruction every three clocks >> >> > it is hard to see how unrolling the loop will help. >> >> > Intel cpu (since sandy bridge) will run a two clock loop. >> >> > With three clocks to play with it should be easy (even for a compiler) >> >> > to generate a loop with no extra clock stalls. >> >> > >> >> > Clearly if Clang decides to copy arguments to the stack an extra time >> >> > that will kill things. But in this case you want the "m" constraint >> >> > to directly read from the buffer (with a (reg,reg,8) addressing mode). >> >> > >> >> Below is what Clang generates with the builtins. From what Eric said, >> >> this code is only run for sizes <= 512 bytes? So maybe it's not super >> >> important to micro-optimize this. I apologize, but my ability to >> >> measure clock loops for x86 code isn't great. (I'm sure I lack the >> >> requisite benchmarks, etc.) >> > >> >Jeepers - that is trashing the I-cache. >> >Not to mention all the conditional branches at the bottom. >> >Consider the basic loop: >> >1: crc32q (%rcx), %rbx >> > addq $8, %rcx >> > cmp %rcx, %rdx >> > jne 1b >> >The crc32 has latency 3 so it must take at least 3 clocks. >> >Even naively the addq can be issued in the same clock as the crc32 >> >and the cmp and jne in the following ones. >> >Since the jne is predicted taken, the addq can be assumed to execute >> >in the same clock as the jne. >> >(The cmp+jne might also get merged into a single u-op) >> >(I've done this with adc (for IP checksum), with two adc the loop takes >> >two clocks even with the extra memory reads.) >> > >> >So that loop is likely to run limited by the three clock latency of crc32. >> >Even the memory reads will happen with all the crc32 just waiting for the >> >previous crc32 to finish. >> >You can take an instruction out of the loop: >> >1: crc32q (%rcx,%rdx), %rbx >> > addq $8, %rdx >> > jne 1b >> >but that may not be necessary, and (IIRC) gcc doesn't like letting you >> >generate it. >> > >> >For buffers that aren't multiples of 8 bytes 'remember' that the crc of >> >a byte depends on how far it is from the end of the buffer, and that initial >> >zero bytes have no effect. >> >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift >> >right by the number of bytes needed to make the rest of the buffer a multiple >> >or 8 bytes (the same as reading from across the start of the buffer and masking >> >the low bytes) then treat exactly the same as a buffer that is a multiple >> >of 8 bytes long. >> >Don't worry about misaligned reads, you lose less than one clock per cache >> >line (that is with adc doing a read every clock). >> > >For reference, GCC does much better with code gen, but only with the builtin: > >.L39: > crc32q (%rax), %rbx # MEM[(long unsigned int *)p_40], tmp120 > addq $8, %rax #, p > cmpq %rcx, %rax # _37, p > jne .L39 #, > leaq (%rsi,%rdi,8), %rsi #, p >.L38: > andl $7, %edx #, len > je .L41 #, > addq %rsi, %rdx # p, _11 > movl %ebx, %eax # crc, <retval> > .p2align 4 >.L40: > crc32b (%rsi), %eax # MEM[(const u8 *)p_45], <retval> > addq $1, %rsi #, p > cmpq %rsi, %rdx # p, _11 > jne .L40 #, > >> >Actually measuring the performance is hard. >> >You can use rdtsc because the clock speed will change when the cpu gets busy. >> >There is a 'performance counter' that is actual clocks. >> >While you can use the library functions to set it up, you need to just read the >> >register - the library overhead it too big. >> >You also need the odd lfence. >> >Having done that, and provided the buffer is in the L1 d-cache you can measure >> >the loop time in clocks and compare against the expected value. >> >Once you've got 3 clocks per crc32 instruction it won't get any better, >> >which is why the 'fast' code for big buffers does crc of 3+ buffers sections >> >in parallel. >> > >Thanks for the info! It'll help a lot the next time I need to delve >deeply into performance. > >I tried using rdtsc and another programmatic way of measuring timing. >Also tried making the task have high priority, restricting to one CPU, >etc. But the numbers weren't as consistent as I wanted them to be. The >times I reported were the based on the fastest times / clocks / >whatever from several runs for each build. > >> > David >> > >> >> >> >> -bw >> >> >> >> .LBB1_9: # =>This Inner Loop Header: Depth=1 >> >> movl %ebx, %ebx >> >> crc32q (%rcx), %rbx >> >> addq $8, %rcx >> >> incq %rdi >> >> cmpq %rdi, %rsi >> >> jne .LBB1_9 >> >> # %bb.10: >> >> subq %rdi, %rax >> >> jmp .LBB1_11 >> >> .LBB1_7: >> >> movq %r14, %rcx >> >> .LBB1_11: >> >> movq %r15, %rsi >> >> andq $-8, %rsi >> >> cmpq $7, %rdx >> >> jb .LBB1_14 >> >> # %bb.12: >> >> xorl %edx, %edx >> >> .LBB1_13: # =>This Inner Loop Header: Depth=1 >> >> movl %ebx, %ebx >> >> crc32q (%rcx,%rdx,8), %rbx >> >> crc32q 8(%rcx,%rdx,8), %rbx >> >> crc32q 16(%rcx,%rdx,8), %rbx >> >> crc32q 24(%rcx,%rdx,8), %rbx >> >> crc32q 32(%rcx,%rdx,8), %rbx >> >> crc32q 40(%rcx,%rdx,8), %rbx >> >> crc32q 48(%rcx,%rdx,8), %rbx >> >> crc32q 56(%rcx,%rdx,8), %rbx >> >> addq $8, %rdx >> >> cmpq %rdx, %rax >> >> jne .LBB1_13 >> >> .LBB1_14: >> >> addq %rsi, %r14 >> >> .LBB1_15: >> >> andq $7, %r15 >> >> je .LBB1_23 >> >> # %bb.16: >> >> crc32b (%r14), %ebx >> >> cmpl $1, %r15d >> >> je .LBB1_23 >> >> # %bb.17: >> >> crc32b 1(%r14), %ebx >> >> cmpl $2, %r15d >> >> je .LBB1_23 >> >> # %bb.18: >> >> crc32b 2(%r14), %ebx >> >> cmpl $3, %r15d >> >> je .LBB1_23 >> >> # %bb.19: >> >> crc32b 3(%r14), %ebx >> >> cmpl $4, %r15d >> >> je .LBB1_23 >> >> # %bb.20: >> >> crc32b 4(%r14), %ebx >> >> cmpl $5, %r15d >> >> je .LBB1_23 >> >> # %bb.21: >> >> crc32b 5(%r14), %ebx >> >> cmpl $6, %r15d >> >> je .LBB1_23 >> >> # %bb.22: >> >> crc32b 6(%r14), %ebx >> >> .LBB1_23: >> >> movl %ebx, %eax >> >> .LBB1_24: >> > >> > >> >> The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown? > >Definitely on the weird side! I considered hard-coding something like >that, but thought it might be a bit convoluted, though certainly less >convoluted than what we generate now. A simple loop is probably all >that's needed, because it should only need to be done at most seven >times. > >-bw > 4-2-1 makes more sense probably (4 bytes, then 2 bytes, then 1 byte depending on which bits are set.)