On Thu, 23 Jan 2025 09:07:44 -0500 "Theodore Ts'o" <tytso@xxxxxxx> wrote: > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > only generic CRC32 implementation. The generic code has become increasingly > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > tends to be 10 to 100 times faster on long messages. > > Yeah, that's my intuition as well; I would think the CPU's that > don't have a CRC32 optimization instruction(s) would probably be the > most sensitive to dcache thrashing. > > But given that Geert ran into this on m68k (I assume), maybe we could > have him benchmark the various crc32 generic implementation to see if > we is the best for him? That is, assuming that he cares (which he > might not. :-). The difference between the clock speed and main memory speed on an m68k will be a lot less than on anything more recent. So I suspect the effect of cache misses is much less (or more likely it is pretty much always getting a cache miss). Brain wakes up, does the m68k even have a D-cache? Checks the m68k user manual section 6 - it only has a I-cache (64 32-bit words). So the important thing is probably keeping the loop small. A cpu board might have an external data cache. For a small memory footprint it might be worth considering 4 bits at a time. So a 16 word (64 byte) lookup table. Thinks.... You can xor a data byte onto the crc 'accumulator' and then do two separate table lookups for each of the high nibbles and xor both onto it before the rotate. That is probably a reasonable compromise. David