On 24 August 2018 at 16:32, Jeffrey Lien <Jeff.Lien@xxxxxxx> wrote: > I rebuilt my 4.18 kernel with CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y as Martin recommended and got even better performance results vs the CRC Slice by 16 changes. Here's a summary of the results > > FIO Sequential Write, 64K Block Size, Queue Depth 64 > PCLMUL = y Kernel: bw = 2237 MiB/s > Slice by 16 CRC Calc: bw = 1964 MiB/s > Base Kernel: bw = 357 MiB/s > > FIO Sequential Read, 64K Block Size, Queue Depth 64 > PCLMUL = y Kernel: bw = 3839 MiB/s > Slice by 16 CRC Calc: bw = 2730 MiB/s > Base Kernel: bw = 797 MiB/s > > So it seems the CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y provides the best performance. Are there any negative side effect to this config option? If not, does it make sense to recommend all the major distro's change their config options to have CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y as the default option? > I think the way the library version of crc_t10dif() invokes the crypto API should be revised. Would it be possible to allocate the crypto transform upon first use instead of from an initcall? If crc_t10dif() is mostly called from non-process context, that would not really work, but otherwise, we could simply defer it (and occasional calls from non-process context that do occur would use the generic code until the point where another call from process context allocates the transform) > -----Original Message----- > From: Christoph Hellwig [mailto:hch@xxxxxxxxxxxxx] > Sent: Wednesday, August 22, 2018 1:20 AM > To: Martin K. Petersen <martin.petersen@xxxxxxxxxx> > Cc: Jeffrey Lien <Jeff.Lien@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; tim.c.chen@xxxxxxxxxxxxxxx; David Darrington <david.darrington@xxxxxxx>; Jeff Furlong <jeff.furlong@xxxxxxx> > Subject: Re: [PATCH] Performance Improvement in CRC16 Calculations. > > On Tue, Aug 21, 2018 at 09:40:34PM -0400, Martin K. Petersen wrote: >> When crc-t10dif is initialized, the crypto infrastructure will pick >> the algorithm with the highest priority currently registered. Both >> block and SCSI will cause crc-t10dif to be compiled as a built-in so >> this selection happens very early. > > Ouch. This might actually happen in a lot of other users of the crypto functionality as well. > >> However, it seems like a bit of a deficiency in crypto that there is >> no way to upgrade existing transformations if higher priority >> algorithms become available. btrfs and a few others work around this >> issue by not using the generic lib/ CRC functions (which defeats the >> purpose of having these in the first place). Instead they are >> registering their own transformation at a later time where any >> accelerator modules are more likely to be loaded. > > If we can't fix this in crypto (which doesn't seem that easy), we should at least clearly document the issue somewhere, and fix this in the t10pi code by initializing crct10dif_tfm in a lazy fashion only once the fist block device starts using it.