I rebuilt my 4.18 kernel with CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y as Martin recommended and got even better performance results vs the CRC Slice by 16 changes. Here's a summary of the results FIO Sequential Write, 64K Block Size, Queue Depth 64 PCLMUL = y Kernel: bw = 2237 MiB/s Slice by 16 CRC Calc: bw = 1964 MiB/s Base Kernel: bw = 357 MiB/s FIO Sequential Read, 64K Block Size, Queue Depth 64 PCLMUL = y Kernel: bw = 3839 MiB/s Slice by 16 CRC Calc: bw = 2730 MiB/s Base Kernel: bw = 797 MiB/s So it seems the CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y provides the best performance. Are there any negative side effect to this config option? If not, does it make sense to recommend all the major distro's change their config options to have CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y as the default option? Jeff Lien -----Original Message----- From: Christoph Hellwig [mailto:hch@xxxxxxxxxxxxx] Sent: Wednesday, August 22, 2018 1:20 AM To: Martin K. Petersen <martin.petersen@xxxxxxxxxx> Cc: Jeffrey Lien <Jeff.Lien@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; tim.c.chen@xxxxxxxxxxxxxxx; David Darrington <david.darrington@xxxxxxx>; Jeff Furlong <jeff.furlong@xxxxxxx> Subject: Re: [PATCH] Performance Improvement in CRC16 Calculations. On Tue, Aug 21, 2018 at 09:40:34PM -0400, Martin K. Petersen wrote: > When crc-t10dif is initialized, the crypto infrastructure will pick > the algorithm with the highest priority currently registered. Both > block and SCSI will cause crc-t10dif to be compiled as a built-in so > this selection happens very early. Ouch. This might actually happen in a lot of other users of the crypto functionality as well. > However, it seems like a bit of a deficiency in crypto that there is > no way to upgrade existing transformations if higher priority > algorithms become available. btrfs and a few others work around this > issue by not using the generic lib/ CRC functions (which defeats the > purpose of having these in the first place). Instead they are > registering their own transformation at a later time where any > accelerator modules are more likely to be loaded. If we can't fix this in crypto (which doesn't seem that easy), we should at least clearly document the issue somewhere, and fix this in the t10pi code by initializing crct10dif_tfm in a lazy fashion only once the fist block device starts using it.