On Thu, Jan 23, 2025 at 03:42:09PM -0800, Eric Biggers wrote: > On Thu, Jan 23, 2025 at 05:36:40PM -0500, Kent Overstreet wrote: > > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote: > > > > As you probably noticed, the other problem is that CRC32 has 4 generic > > > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > > > When benchmarking slice-by-n, a higher n will always be faster in > > > > microbenchmarks (up to about n=16), but the required table size also increases > > > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > > > the dcache, and can be really bad for performance in real world scenarios where > > > > the system is bottlenecked on memory. > > > > > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > > > and have that be the only generic CRC32 implementation. > > > > > > > > But I need to try an interleaved implementation too, since it's possible that > > > > could give the best of both worlds. > > > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > > only generic CRC32 implementation. The generic code has become increasingly > > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > > tends to be 10 to 100 times faster on long messages. > > > > > > The generic CRC32 code is still needed when the CPU features needed by the arch > > > code are unavailable. But that's rare these days. It's also still needed when > > > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the > > > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too > > > short for the overhead of saving and restoring the vector registers to be worth > > > it -- typically < 64 bytes or so. And it's still needed when the CRC is done in > > > a context where vector registers can't be used at all. > > > > > > But those don't feel like very strong motivations for the huge tables anymore. > > > I think the huge tables were really intended for optimizing CRCs of long > > > messages back when CPUs didn't have any better way to do it. > > > > Have you by chance been looking at performance of 64 bit crc algorithms, > > or have any recommendations there? > > > > I've been starting to consider switching to a 64 bit crc for the > > default for bcachefs - and we do want it to be a regular crc so we can > > combine/add them together, not one of the newer fancier add/xor based > > hashes. > > > > I thought we'd gotten a faster PCLMULQDQ based implementation of a > > crc64, but looking again it appears not, hrm. > > Yep! I've written an assembly template that expands into a PCLMULQDQ or > VPCLMULQDQ implementation of any variant of CRC-8, CRC-16, CRC-32, or CRC-64. > The latest work-in-progress version can be found at > https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-x86. > We just need to decide which CRC variants to wire it up for. My tentative plan, > which is implemented in that git branch, is crc32_le, crc_t10dif, crc64_be, and > crc64_nvme. (crc64_nvme is currently called crc64_rocksoft, but that name makes > no sense.) For crc32_le and crc_t10dif these would replace the existing > PCLMULQDQ implementations. crc64* on the other hand would gain acceleration for > the first time, improving performance for those by as much as 100x. I'm also > fixing the CRC64 lib to be organized the way I did CRC32 and CRC-T10DIF. > > That work is targeting 6.15, since there was already enough for 6.14. > > Though this approach makes it easy to wire up arbitrary CRC variants for x86, we > do need to make sure that anyone we wire up is actually useful. The users of > the optimized crc64_be would be drivers/md/bcache/ and fs/bcachefs/. So if you > could confirm that that would indeed be useful for you (and in particular you > haven't deprecated the CRC64 support in favor of xxHash), that would be helpful. That's fantastic! CRC64 isn't deprecated, and I suspect that will be our preferred option at some point in the future. xxHash was added only because there is a real desire for 64 bit checksums and crc64 wasn't a good option at the time. (It was also designed as a hash function, not for data protection, and I'd want to see some actual literature evaluating it as such before I'd consider making it the default. And like I mentioned, CRCs have other properties we want).