From: Keith Busch > Sent: 22 February 2022 16:32 > > The crc64 table lookup method is inefficient, using a significant number > of CPU cycles in the block stack per IO. If available on x86, use a > PCLMULQDQ implementation to accelerate the calculation. > > The assembly from this patch was mostly generated by gcc from a C > program using library functions provided by x86 intrinsics, and measures > ~20x faster than the table lookup. I think I'd like to see the C code and compiler options used to generate the assembler as comments in the committed source file. Either that or reasonable comments in the assembler. It is also quite a lot of code. What is the break-even length for 'cold cache' including the FPU saves. ... > +.section .rodata > +.align 32 > +.type shuffleMasks, @object > +.size shuffleMasks, 32 > +shuffleMasks: > + .string "" > + .ascii "\001\002\003\004\005\006\007\b\t\n\013\f\r\016\017\217\216\215" > + .ascii "\214\213\212\211\210\207\206\205\204\203\202\201\200" That has to be the worst way to define 32 bytes. > +.section .rodata.cst16,"aM",@progbits,16 > +.align 16 > +.LC0: > + .quad -1523270018343381984 > + .quad 2443614144669557164 > + .align 16 > +.LC1: > + .quad 2876949357237608311 > + .quad 3808117099328934763 Not sure what those are, but I bet there are better ways to define/describe them. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)