On Thu, Jan 23, 2025 at 08:58:10PM +0000, David Laight wrote: > On Thu, 23 Jan 2025 09:07:44 -0500 > "Theodore Ts'o" <tytso@xxxxxxx> wrote: > > > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > > only generic CRC32 implementation. The generic code has become increasingly > > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > > tends to be 10 to 100 times faster on long messages. > > > > Yeah, that's my intuition as well; I would think the CPU's that > > don't have a CRC32 optimization instruction(s) would probably be the > > most sensitive to dcache thrashing. > > > > But given that Geert ran into this on m68k (I assume), maybe we could > > have him benchmark the various crc32 generic implementation to see if > > we is the best for him? That is, assuming that he cares (which he > > might not. :-). > > The difference between the clock speed and main memory speed on an m68k will > be a lot less than on anything more recent. > So I suspect the effect of cache misses is much less (or more likely it is > pretty much always getting a cache miss). > Brain wakes up, does the m68k even have a D-cache? > Checks the m68k user manual section 6 - it only has a I-cache (64 32-bit words). > So the important thing is probably keeping the loop small. > A cpu board might have an external data cache. > > For a small memory footprint it might be worth considering 4 bits at a time. > So a 16 word (64 byte) lookup table. > Thinks.... > You can xor a data byte onto the crc 'accumulator' and then do two separate > table lookups for each of the high nibbles and xor both onto it before the rotate. > That is probably a reasonable compromise. Yes, you can do less than a byte at a time (currently one of the choices is even one *bit* at a time!), but I think byte-at-a-time is small enough already. - Eric