On Thu, Jan 23, 2025 at 12:52:30PM -0800, Linus Torvalds wrote: > On Thu, 23 Jan 2025 at 10:18, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > > FWIW, benchmarking the CRC library functions is easy now; just enable > > CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y. > > > > But, it's just a traditional benchmark that calls the functions in a loop, and > > doesn't account for dcache thrashing. > > Yeah. I suspect the x86 vector version in particular is just not even > worth it. If you have the crc instruction, the basic arch-optimized > case is presumably already pretty good (and *that* code is tiny). x86 unfortunately only has an instruction for crc32c, i.e. the variant of CRC32 that uses the Castagnoli polynomial. So it works great for crc32c(). But any other variant of CRC such as the regular crc32() or crc_t10dif_update() need carryless multiplication (PCLMULQDQ) which uses the vector registers. It is super fast on sufficiently long messages, but it does use the vector registers. FWIW, arm64 has an instruction for both crc32c and crc32. And RISC-V has carryless multiplication using scalar registers. So things are a bit easier there. > Honestly, I took a quick look at the "by-4" and "by-8" cases, and > considering that you still have to do per-byte lookups of the words > _anyway_, I would expect that the regular by-1 is presumably not that > much worse. The difference I'm seeing on x86_64 (Ryzen 9 9950X) is 690 MB/s for slice-by-1 vs. 3091 MB/s for slice-by-8. It's significant since the latter gives much more instruction-level parallelism. But of course, CPUs on which it matters tend to have *much* faster arch-optimized implementations anyway. Currently the x86_64-optimized crc32c_le() is up to 43607 MB/s on the same CPU, and crc32_le() is up to 22664 MB/s. (By adding VPCLMULQDQ support we could actually achieve over 80000 MB/s.) The caveat is that in the [V]PCLMULQDQ case the data length has to be long enough for it to be worthwhile, but then again having a 8 KiB randomly-accessed data table just to micro-optimize short messages seems not too worthwhile. > IOW, maybe we could try to just do the simple by-1 for the generic > case, and cut the x86 version down to the simple "use crc32b" case. > And see if anybody even notices... > > Linus - Eric