On Thu, Jan 23, 2025 at 09:07:44AM -0500, Theodore Ts'o wrote: > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > only generic CRC32 implementation. The generic code has become increasingly > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > tends to be 10 to 100 times faster on long messages. > > Yeah, that's my intuition as well; I would think the CPU's that > don't have a CRC32 optimization instruction(s) would probably be the > most sensitive to dcache thrashing. > > But given that Geert ran into this on m68k (I assume), maybe we could > have him benchmark the various crc32 generic implementation to see if > we is the best for him? That is, assuming that he cares (which he > might not. :-). FWIW, benchmarking the CRC library functions is easy now; just enable CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y. But, it's just a traditional benchmark that calls the functions in a loop, and doesn't account for dcache thrashing. It's exactly the sort of benchmark I mentioned doesn't tell the whole story about the drawbacks of using a huge table. So focusing only on microbenchmarks of slice-by-n generally leads to a value n > 1 seeming optimal --- potentially as high as n=16 depending on the CPU, but really old CPUs like m68k should need much less. So the rationale of choosing "slice-by-1" in the kernel would be to consider the reduced dcache use and code size, and the fact that arch-optimized code is usually used instead these days anyway, to be more important than microbenchmark results. (And also the other CRC variants in the kernel like CRC64, CRC-T10DIF, CRC16, etc. already just have slice-by-1, so this would make CRC32 consistent with that.) - Eric