Re: [GIT PULL] CRC updates for 6.14

Eric Biggers <ebiggers@xxxxxxxxxx> · Thu, 23 Jan 2025 13:13:17 -0800

On Thu, Jan 23, 2025 at 12:52:30PM -0800, Linus Torvalds wrote:
> On Thu, 23 Jan 2025 at 10:18, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> >
> > FWIW, benchmarking the CRC library functions is easy now; just enable
> > CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y.
> >
> > But, it's just a traditional benchmark that calls the functions in a loop, and
> > doesn't account for dcache thrashing.
> 
> Yeah. I suspect the x86 vector version in particular is just not even
> worth it. If you have the crc instruction, the basic arch-optimized
> case is presumably already pretty good (and *that* code is tiny).

x86 unfortunately only has an instruction for crc32c, i.e. the variant of CRC32
that uses the Castagnoli polynomial.  So it works great for crc32c().  But any
other variant of CRC such as the regular crc32() or crc_t10dif_update() need
carryless multiplication (PCLMULQDQ) which uses the vector registers.  It is
super fast on sufficiently long messages, but it does use the vector registers.

FWIW, arm64 has an instruction for both crc32c and crc32.  And RISC-V has
carryless multiplication using scalar registers.  So things are a bit easier
there.

> Honestly, I took a quick look at the "by-4" and "by-8" cases, and
> considering that you still have to do per-byte lookups of the words
> _anyway_, I would expect that the regular by-1 is presumably not that
> much worse.

The difference I'm seeing on x86_64 (Ryzen 9 9950X) is 690 MB/s for slice-by-1
vs. 3091 MB/s for slice-by-8.  It's significant since the latter gives much more
instruction-level parallelism.  But of course, CPUs on which it matters tend to
have *much* faster arch-optimized implementations anyway.  Currently the
x86_64-optimized crc32c_le() is up to 43607 MB/s on the same CPU, and crc32_le()
is up to 22664 MB/s.  (By adding VPCLMULQDQ support we could actually achieve
over 80000 MB/s.)  The caveat is that in the [V]PCLMULQDQ case the data length
has to be long enough for it to be worthwhile, but then again having a 8 KiB
randomly-accessed data table just to micro-optimize short messages seems not too
worthwhile.

> IOW, maybe we could try to just do the simple by-1 for the generic
> case, and cut the x86 version down to the simple "use crc32b" case.
> And see if anybody even notices...
> 
>               Linus

- Eric