Eric Biggers <ebiggers@xxxxxxxxxx> writes: > On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote: >> Hi Eric, >> >> We are more interested in offload than performance, with splice system >> call and DMA mode in driver(will be implemented after this series gets >> merged), good amount of cpu cycles will be saved. > > So it's for power usage, then? Or freeing up CPU for other tasks? > It's for freeing CPU fpr other tasks >> There is one more mode(auto mode) in mcrc64 which helps to verify crc64 >> values against pre calculated crc64, saving the efforts of comparing in >> userspace. > > Is there any path forward to actually support this? > >> >> Current generic implementation of crc64-iso(part of this series) >> gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812 >> Mb/s when tested with tcrypt. > > This doesn't answer my question, which to reiterate was: > > How does performance compare to a properly optimized software CRC > implementation on your platform, i.e. an implementation using carryless > multiplication instructions (e.g. ARMv8 CE) if available on your platform, > otherwise an implementation using the slice-by-8 or slice-by-16 method? > > The implementation you tested was slice-by-1. Compared to that, it's common for > slice-by-8 to speed up CRCs by about 4 times and for folding with carryless > multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory > bandwidth. I don't know what specific results you would get on your specific > CPU and for this specific CRC, and you could certainly see something different > if you e.g. have some low-end embedded CPU. But those are the typical results > I've seen for other CRCs on different CPUs. So, a software implementation may > be more attractive than you realize. It could very well be the case that a > PMULL based CRC implementation actually ends up with less CPU load than your > "hardware offload", when taking into syscall, algif_hash, and driver overhead... > > - Eric Hi Eric, thanks for your detailed and valuable inputs. As per your suggestion, we did some profiling. Use case is to calculate crc32/crc64 for file input from user space. Instead of directly implementing PMULL based CRC64, we made first comparison between Case 1. CRC32 (splice() + kernel space SW driver) https://gist.github.com/ti-kamlesh/5be75dbde292e122135ddf795fad9f21 Case 2. CRC32(mmap() + userspace armv8 crc32 instruction implementation) (tried read() as well to get contents of file, but that lost to mmap() so not mentioning number here) https://gist.github.com/ti-kamlesh/002df094dd522422c6cb62069e15c40d Case 3. CRC64 (splice() + MCRC64 HW) https://gist.github.com/ti-kamlesh/98b1fc36c9a7c3defcc2dced4136b8a0 Overall, overhead of userspace + af_alg + driver in (Case 1) and ( Case 3) is ~0.025s, which is constant for any file size. This is calculated using real time to calculate crc - driver time (time spend inside init() + update() +final()) = overhead ~0.025s Here, if we consider similar numbers for crc64 PMULL implementation as crc32 (case 2) , we save good number of cpu cycles using mcrc64 in case of files bigger than 5-10mb as most of the time is being spent in HW offload. â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¦â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¦â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¦â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¦â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?? â?? â?? â?? â?? â?? â?? â?? File size â?? 120mb(ideal size for us) â?? 20mb â?? 15mb â?? 5mb â?? â? â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?£ â?? â?? â?? â?? â?? â?? â?? CRC32 (Case 1) â?? Driver time 0.155s â?? Driver time 0.0325s â?? Driver time 0.019s â?? Driver time 0.0062s â?? â?? â?? real time 0.18s â?? real time 0.06s â?? real time 0.04s â?? real time 0.03s â?? â?? â?? overhead 0.025s â?? overhead 0.025s â?? overhead 0.021s â?? overhead ~0.023s â?? â? â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?£ â?? â?? â?? â?? â?? â?? â?? CRC32 (Case 2) â?? Real time 0.30s â?? Real time 0.05s â?? Real time 0.04s â?? Real time 0.02s â?? â? â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?¬â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?£ â?? â?? â?? â?? â?? â?? â?? CRC64 (Case 3) â?? Driver time 0.385s â?? Driver time 0.0665s â?? Driver time 0.0515s â?? Driver time 0.019s â?? â?? â?? real time 0.41s â?? real time 0.09s â?? real time 0.08s â?? real time 0.04s â?? â?? â?? overhead 0.025s â?? overhead 0.025s â?? overhead ~0.025s â?? overhead ~0.021s â?? â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?©â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?©â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?©â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â?©â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??â??