Hi Linus, On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote: > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > > directly accessible via the library API, instead of requiring the > > crypto API. This is much simpler and more efficient. > > I'm not a fan of the crazy crypto interfaces for simple hashes that > only complicate and slow things down, so I'm all in favor of this and > have pulled it. > > HOWEVER. > > I'm also very much not a fan of asking users pointless questions. > > What does this patch-set ask users idiotic questions like > > CRC-T10DIF implementation > > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) > 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) > > and > > CRC32 implementation > > 1. Arch-optimized, with fallback to slice-by-8 > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) > 2. Arch-optimized, with fallback to slice-by-1 > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) > 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) > 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) > 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) > 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) > > because *nobody* wants to see that completely pointless noise. > > Pick the best one. Don't ask the user to pick the best one. > > If you have some really strong argument for why users need to be able > to override the sane choice, make the question it at *least* depend on > EXPERT. > > And honestly, I don't see how there could possibly ever be any point. > If there is an arch-optimized version, just use it. > > And if the "optimized" version is crap and worse than some generic > one, it just needs to be removed. > > None of this "make the user make the choice because kernel developers > can't deal with the responsibility of just saying what is best". Yes, I agree, and the kconfig options are already on my list of things to clean up. Thanks for giving your thoughts on how to do it. To be clarify, this initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices in lib, one of which already existed. So for now the changes to the options were just meant to consolidate them, not add to or remove from them per se. I do think that to support kernel size minimization efforts we should continue to allow omitting the arch-specific CRC code. One of the CRC options, usually CONFIG_CRC32, gets built into almost every kernel. Some options already group together multiple CRC variants (e.g. there are three different CRC32's), and each can need multiple implementations targeting different instruction set extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86). So it does add up. But it makes sense to make the code be included by default, and make the choice to omit it be conditional on CONFIG_EXPERT. I'm also thinking of just doing a single option that affects all enabled CRC variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and CRC_T10DIF_OPTIMIZATIONS. Let me know if you think that would be reasonable. As you probably noticed, the other problem is that CRC32 has 4 generic implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. It's not straightforward to choose between slice by 1 and slice by 4/8, though. When benchmarking slice-by-n, a higher n will always be faster in microbenchmarks (up to about n=16), but the required table size also increases accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 uses a 8192-byte table. This table is accessed randomly, which is really bad on the dcache, and can be really bad for performance in real world scenarios where the system is bottlenecked on memory. I'm tentatively planning to just say that slice-by-4 is a good enough compromise and have that be the only generic CRC32 implementation. But I need to try an interleaved implementation too, since it's possible that could give the best of both worlds. - Eric