Hi David, > While not part of this change, the unrolled loops look as though > they just destroy the cpu cache. > I'd like be convinced that anything does CRC over long enough buffers > to make it a gain at all. > > With modern (not that modern now) superscalar cpus you can often > get the loop instructions 'for free'. > Sometimes pipelining the loop is needed to get full throughput. > Unlike the IP checksum, you don't even have to 'loop carry' the > cpu carry flag. Internal testing on a NVMe device with T10DIF enabled on 4k blocks shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic uses over 60% of CPU time - with these patches CRC drops to single digits. I should probably have lead with that, sorry. FWIW, the original patch showed a 3.7x gain on btrfs as well - 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") When Anton wrote the original code he had access to IBM's internal tooling for looking at how instructions flow through the various stages of the CPU, so I trust it's pretty much optimal from that point of view. Regards, Daniel