On Tue, 4 Mar 2025 13:32:16 -0800 Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > From: Eric Biggers <ebiggers@xxxxxxxxxx> > > For handling the 0 <= len < sizeof(unsigned long) bytes left at the end, > do a 4-2-1 step-down instead of a byte-at-a-time loop. This allows > taking advantage of wider CRC instructions. Note that crc32c-3way.S > already uses this same optimization too. An alternative is to add extra zero bytes at the start of the buffer. They don't affect the crc and just need the first 8 bytes shifted left. I think any non-zero 'crc-in' just needs to be xor'ed over the first 4 actual data bytes. (It's over 40 years since I did the maths of CRC.) You won't notice the misaligned accesses all down the buffer. When I was testing different ipcsum code misaligned buffers cost less than 1 clock per cache line. I think that was even true for the versions that managed 12 bytes per clock (including the one Linus committed). David