On Mon, Oct 14, 2024 at 04:30:05PM +0000, David Laight wrote: > From: Eric Biggers > > Sent: 14 October 2024 05:25 > > > > crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully > > unrolled and uses a jump table to jump into the correct location. This > > optimization is misguided, as it bloats the binary code size and > > introduces an indirect call. x86_64 CPUs can predict loops well, so it > > is fine to just use a loop instead. Loop bookkeeping instructions can > > compete with the crc instructions for the ALUs, but this is easily > > mitigated by unrolling the loop by a smaller amount, such as 4 times. > > Do you need to unroll it at all? It looks like on most CPUs, no. On Haswell, Emerald Rapids, Zen 2 it does not make a significant difference. However, it helps on Zen 5. > > + # Unroll the loop by a factor of 4 to reduce the overhead of the loop > > + # bookkeeping instructions, which can compete with crc32q for the ALUs. > > +.Lcrc_3lanes_4x_loop: > > + crc32q (bufp), crc_init_q > > + crc32q (bufp,chunk_bytes_q), crc1 > > + crc32q (bufp,chunk_bytes_q,2), crc2 > > + crc32q 8(bufp), crc_init_q > > + crc32q 8(bufp,chunk_bytes_q), crc1 > > + crc32q 8(bufp,chunk_bytes_q,2), crc2 > > + crc32q 16(bufp), crc_init_q > > + crc32q 16(bufp,chunk_bytes_q), crc1 > > + crc32q 16(bufp,chunk_bytes_q,2), crc2 > > + crc32q 24(bufp), crc_init_q > > + crc32q 24(bufp,chunk_bytes_q), crc1 > > + crc32q 24(bufp,chunk_bytes_q,2), crc2 > > + add $32, bufp > > + sub $4, %eax > > + jge .Lcrc_3lanes_4x_loop > > If you are really lucky you'll get two memory reads/clock. > So you won't ever to do than two crc32/clock. > Looking at Agner's instruction latency tables I don't think > any cpu can do more that one per clock, or pipeline them. > I think that means you don't even need two (never mind 3) > buffers. On most Intel and AMD CPUs (I tested Haswell for old Intel, Emerald Rapids for new Intel, and Zen 2 for slightly-old AMD), crc32q has 3 cycle latency and 1 per cycle throughput. So you do need at least 3 streams. AMD Zen 5 has much higher crc32q throughput and seems to want up to 7 streams. This is not implemented yet. > Most modern x86 can do 4 or 5 (or even more) ALU operations > per clock - depending on the combination of instructions. > > Replace the loop termination with a comparison of 'bufp' > against a pre-calculated limit and you get two instructions > (that might get merged into one u-op) for the loop overhead. > They'll run in parallel with the crc32q instructions. That's actually still three instructions: add, cmp, and jne. I tried it on both Intel and AMD, and it did not help. > I've never managed to get a 1-clock loop, but two is easy. > You might find that just: > 10: > crc32q (bufp), crc > crc32q 8(bufp), crc > add $16, bufp > cmp bufp, buf_lim > jne 10b > will run at 8 bytes/clock on modern intel cpu. No, the latency of crc32q is still three cycles. You need three streams. > You can write that in C with a simple asm function for the crc32 > instruction itself. Well, the single-stream CRC32C implementation already does that; see arch/x86/crypto/crc32c-intel_glue.c. Things are not as simple for this multi-stream implementation, which uses pclmulqdq to combine the CRCs. - Eric