On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote: > Hi Eric, > > We are more interested in offload than performance, with splice system > call and DMA mode in driver(will be implemented after this series gets > merged), good amount of cpu cycles will be saved. So it's for power usage, then? Or freeing up CPU for other tasks? > There is one more mode(auto mode) in mcrc64 which helps to verify crc64 > values against pre calculated crc64, saving the efforts of comparing in > userspace. Is there any path forward to actually support this? > > Current generic implementation of crc64-iso(part of this series) > gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812 > Mb/s when tested with tcrypt. This doesn't answer my question, which to reiterate was: How does performance compare to a properly optimized software CRC implementation on your platform, i.e. an implementation using carryless multiplication instructions (e.g. ARMv8 CE) if available on your platform, otherwise an implementation using the slice-by-8 or slice-by-16 method? The implementation you tested was slice-by-1. Compared to that, it's common for slice-by-8 to speed up CRCs by about 4 times and for folding with carryless multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory bandwidth. I don't know what specific results you would get on your specific CPU and for this specific CRC, and you could certainly see something different if you e.g. have some low-end embedded CPU. But those are the typical results I've seen for other CRCs on different CPUs. So, a software implementation may be more attractive than you realize. It could very well be the case that a PMULL based CRC implementation actually ends up with less CPU load than your "hardware offload", when taking into syscall, algif_hash, and driver overhead... - Eric