On Mon, Mar 19, 2018 at 1:30 PM, Eric Biggers <ebiggers3@xxxxxxxxx> wrote: > On Mon, Mar 19, 2018 at 12:39:55PM -0700, Andiry Xu wrote: >> On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebiggers3@xxxxxxxxx> wrote: >> > On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote: >> >> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys >> >> maintainer] >> >> >> >> On 10.03.2018 20:17, Andiry Xu wrote: >> >> <snip> >> >> >> >> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len) >> >> > +{ >> >> > + u8 *ptr = (u8 *) data; >> >> > + u64 acc = crc; /* accumulator, crc32c value in lower 32b */ >> >> > + u32 csum; >> >> > + >> >> > + /* x86 instruction crc32 is part of SSE-4.2 */ >> >> > + if (static_cpu_has(X86_FEATURE_XMM4_2)) { >> >> > + /* This inline assembly implementation should be equivalent >> >> > + * to the kernel's crc32c_intel_le_hw() function used by >> >> > + * crc32c(), but this performs better on test machines. >> >> > + */ >> >> > + while (len > 8) { >> >> > + asm volatile(/* 64b quad words */ >> >> > + "crc32q (%1), %0" >> >> > + : "=r" (acc) >> >> > + : "r" (ptr), "0" (acc) >> >> > + ); >> >> > + ptr += 8; >> >> > + len -= 8; >> >> > + } >> >> > + >> >> > + while (len > 0) { >> >> > + asm volatile(/* trailing bytes */ >> >> > + "crc32b (%1), %0" >> >> > + : "=r" (acc) >> >> > + : "r" (ptr), "0" (acc) >> >> > + ); >> >> > + ptr++; >> >> > + len--; >> >> > + } >> >> > + >> >> > + csum = (u32) acc; >> >> > + } else { >> >> > + /* The kernel's crc32c() function should also detect and use the >> >> > + * crc32 instruction of SSE-4.2. But calling in to this function >> >> > + * is about 3x to 5x slower than the inline assembly version on >> >> > + * some test machines. >> >> >> >> That is really odd. Did you try to characterize why this is the case? Is >> >> it purely the overhead of dispatching to the correct backend function? >> >> That's a rather big performance hit. >> >> >> >> > + */ >> >> > + csum = crc32c(crc, data, len); >> >> > + } >> >> > + >> >> > + return csum; >> >> > +} >> >> > + >> > >> > Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and >> > that the accelerated version was being called? Or, perhaps CRC32C_PCL_BREAKEVEN >> > (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted. Please >> > don't hack around performance problems like this; if they exist, they need to be >> > fixed for everyone. >> > >> >> I have performed the crc32c test on a Xeon X5647 at 2.93GHz, 14G DDR3 >> memory at 1066MHz platform. >> You are right that enabling CONFIG_CRYPTO_CRC32C_INTEL improves the >> performance significantly. nova_crc32c() is still slightly faster than >> crc32c() with the flag enabled. >> >> Result numbers are follows: data size in bytes, latency in ns, column >> 3 is crc32c() with CONFIG_CRYPTO_CRC32C_INTEL enabled and column 4 >> disabled. >> >> data size (bytes) nova_crc32c() crc32c() -enabled >> crc32c() -disabled >> 64 19 21 56 >> 128 28 29 99 >> 256 46 43 182 >> 512 82 149 354 >> 1024 157 232 728 >> 2048 305 415 1440 >> 4096 603 725 2869 >> > > Probably CRC32C_PCL_BREAKEVEN needs to be adjusted for that CPU, as I suggested > may be the case; notice that your measured speeds are about the same before 512 > (CRC32C_PCL_BREAKEVEN) bytes, but the crypto API version is slower at >= 512 > bytes. It would be possible to set the breakeven point in > crc32c_intel_mod_init() depending on the CPU. Again, if the performance is not > good enough you need to fix it for everyone, not hack around it. > We verify that by setting CRC32C_PCL_BREAKEVEN to 8192, the performance difference between nova_crc32c() and kernel's crc32c() is negligible. Thanks for the comments, and I will use kernel's crc32c() in the next version. Thanks, Andiry