From: Daniel Axtens > Sent: 15 March 2017 22:30 > Hi David, > > > While not part of this change, the unrolled loops look as though > > they just destroy the cpu cache. > > I'd like be convinced that anything does CRC over long enough buffers > > to make it a gain at all. > > > > With modern (not that modern now) superscalar cpus you can often > > get the loop instructions 'for free'. > > Sometimes pipelining the loop is needed to get full throughput. > > Unlike the IP checksum, you don't even have to 'loop carry' the > > cpu carry flag. > > Internal testing on a NVMe device with T10DIF enabled on 4k blocks > shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic > uses over 60% of CPU time - with these patches CRC drops to single > digits. > > I should probably have lead with that, sorry. I'm not doubting that using the cpu instruction for crcs gives a massive performance boost. Just that the heavily unrolled loop is unlikely to help overall. Some 'cold cache' tests on shorter buffers might be illuminating. > FWIW, the original patch showed a 3.7x gain on btrfs as well - > 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") > > When Anton wrote the original code he had access to IBM's internal > tooling for looking at how instructions flow through the various stages > of the CPU, so I trust it's pretty much optimal from that point of view. Doesn't mean he used it :-) David