From: Eric Biggers > Sent: 14 October 2024 05:25 > > crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully > unrolled and uses a jump table to jump into the correct location. This > optimization is misguided, as it bloats the binary code size and > introduces an indirect call. x86_64 CPUs can predict loops well, so it > is fine to just use a loop instead. Loop bookkeeping instructions can > compete with the crc instructions for the ALUs, but this is easily > mitigated by unrolling the loop by a smaller amount, such as 4 times. Do you need to unroll it at all? ... > + # Unroll the loop by a factor of 4 to reduce the overhead of the loop > + # bookkeeping instructions, which can compete with crc32q for the ALUs. > +.Lcrc_3lanes_4x_loop: > + crc32q (bufp), crc_init_q > + crc32q (bufp,chunk_bytes_q), crc1 > + crc32q (bufp,chunk_bytes_q,2), crc2 > + crc32q 8(bufp), crc_init_q > + crc32q 8(bufp,chunk_bytes_q), crc1 > + crc32q 8(bufp,chunk_bytes_q,2), crc2 > + crc32q 16(bufp), crc_init_q > + crc32q 16(bufp,chunk_bytes_q), crc1 > + crc32q 16(bufp,chunk_bytes_q,2), crc2 > + crc32q 24(bufp), crc_init_q > + crc32q 24(bufp,chunk_bytes_q), crc1 > + crc32q 24(bufp,chunk_bytes_q,2), crc2 > + add $32, bufp > + sub $4, %eax > + jge .Lcrc_3lanes_4x_loop If you are really lucky you'll get two memory reads/clock. So you won't ever to do than two crc32/clock. Looking at Agner's instruction latency tables I don't think any cpu can do more that one per clock, or pipeline them. I think that means you don't even need two (never mind 3) buffers. Most modern x86 can do 4 or 5 (or even more) ALU operations per clock - depending on the combination of instructions. Replace the loop termination with a comparison of 'bufp' against a pre-calculated limit and you get two instructions (that might get merged into one u-op) for the loop overhead. They'll run in parallel with the crc32q instructions. I've never managed to get a 1-clock loop, but two is easy. You might find that just: 10: crc32q (bufp), crc crc32q 8(bufp), crc add $16, bufp cmp bufp, buf_lim jne 10b will run at 8 bytes/clock on modern intel cpu. You can write that in C with a simple asm function for the crc32 instruction itself. You might need the more complex to setup loop: offset = -length; bufend = buf + length; 10: crc32q (offset, bufend), crc crc32q 8(offset, bufend), crc add &16, offset jne 10b which uses negative offsets from the end of the buffer. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)