[Note that many of these comments will apply to the arm64 version too.] On Mon, Jan 24, 2022 at 07:44:21PM -0600, Nathan Huckleberry wrote: > Add hardware accelerated version of POLYVAL for x86-64 CPUs with > PCLMULQDQ support. > > This implementation is accelerated using PCLMULQDQ instructions to > perform the finite field computations. For added efficiency, 8 blocks > of the plaintext are processed simultaneously by precomputing the first plaintext => message > diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile > index ed187fcd0b01..0214c5f22606 100644 > --- a/arch/x86/crypto/Makefile > +++ b/arch/x86/crypto/Makefile > @@ -69,6 +69,9 @@ libblake2s-x86_64-y := blake2s-core.o blake2s-glue.o > obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o > ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o > > +obj-$(CONFIG_CRYPTO_POLYVAL_CLMUL_NI_INTEL) += polyval-clmulni-intel.o > +polyval-clmulni-intel-y := polyval-clmulni-intel_asm.o polyval-clmulni-intel_glue.o > + IMO this should be named just polyval-clmulni. Including "intel" is a bit gratuituous, given that AMD supports this too, and this is in the x86 directory. I guess that some of the authors of some of the existing files wanted to include their company name. Doesn't actually matter, though; it's up to you. > diff --git a/arch/x86/crypto/polyval-clmulni-intel_asm.S b/arch/x86/crypto/polyval-clmulni-intel_asm.S > new file mode 100644 > index 000000000000..4339b58e610d > --- /dev/null > +++ b/arch/x86/crypto/polyval-clmulni-intel_asm.S > @@ -0,0 +1,319 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright 2021 Google LLC > + * > + * Use of this source code is governed by an MIT-style > + * license that can be found in the LICENSE file or at > + * https://opensource.org/licenses/MIT. > + */ > +/* > + * This is an efficient implementation of POLYVAL using intel PCLMULQDQ-NI > + * instructions. It works on 8 blocks at a time, computing the 256 degree > + * polynomial p(x) = h^8m_0 + ... + h^1m_7. It then computes the modular > + * reduction of p(x) and XORs p(x) with the current digest. > + */ What does "256 degree polynomial" mean here? > +/* > + * Accepts operand lists of length b in rdi and rsi. In general the first sentence of a comment describing a function or macro should be a summary of what it does, not some particular detail. > + * Computes the product of > + * each rdi,rsi pair then XORs the products into A, B, C, D. Where are A, B, and rsi used? > + * If first == 1 then XOR the value of SUM into the first block processed. > + * This avoids an extra multication of SUM and h^N. first == 1 on the *last* call per 8 blocks. Perhaps it needs a better name? > + * All other xmm registers clobbered This doesn't appear to be true; the code relies on GSTAR not being clobbered. > +.macro schoolbook1_iteration i first > + .set first, \first > + .set i, \i > + movups (16*i)(OP1), %xmm0 > + .if(i == 0 && first == 1) > + pxor SUM, %xmm0 > + .endif I don't think the ".set" statements are necessary here. You can just use \i and \first directly. > +/* > + * Computes first schoolbook step of values loaded into xmm0 and xmm1. Used to > + * multiply intermediate register values rather than memory stored values. > + * > + * XORs product into C, D, EF > + * Preserves SUM > + * All other xmm registers clobbered > + */ > +.macro schoolbook1_noload > + vpclmulqdq $0x01, %xmm0, %xmm1, %xmm2 > + vpxor %xmm2, EF, EF > + vpclmulqdq $0x00, %xmm0, %xmm1, %xmm3 > + vpxor %xmm3, C, C > + vpclmulqdq $0x11, %xmm0, %xmm1, %xmm4 > + vpxor %xmm4, D, D > + vpclmulqdq $0x10, %xmm0, %xmm1, %xmm5 > + vpxor %xmm5, EF, EF > +.endm So C holds the low part of the product, EF the middle part, and D the high part. How about giving these better names, like LO, MID, and HI? > +/* > + * Computes the 128-bit reduction of PL, PH. Stores the result in PH. > + * > + * PL, PH, Z, T. > + * All other xmm registers are preserved. > + */ > +.macro montgomery_reduction > + movdqa PL, T > + pclmulqdq $0x00, GSTAR, T # T = [X0 * g*(x)] > + pshufd $0b01001110, T, Z # Z = [T0 : T1] > + pxor Z, PL # PL = [X1 ^ T0 : X0 ^ T1] > + pxor PL, PH # PH = [X1 ^ T0 ^ X3 : X0 ^ T1 ^ X2] > + pclmulqdq $0x11, GSTAR, PL # PL = [X1 ^ T0 * g*(x)] > + pxor PL, PH > +.endm This really needs a comment that describes at a high level what is going on -- adding multiples of the reduction polynomial to cancel out the low-order parts. And also how Montgomery multiplication works in this context. The one-line comments don't help much, especially since "X" is never defined. Also, it seems like you've implemented an optimization that avoids a second pshufd instruction, over the simpler approach of folding 64 bits up twice in the same way. Can you add a comment that explains this? Also what do the names T and Z mean? If they're just temporary values, TMP1 and TMP2 might be better names. > +/* > + * Compute schoolbook multiplication for 8 blocks > + * (M_0h + REDUCE(PL, PH))h^8 + ... + M_{7}h^1 (no constant term) Shouldn't M_0h be just M_0? Also, isn't the REDUCE part conditional on \reduce? > +/* > + * Perform polynomial evaluation as specified by POLYVAL. Multiplies the value > + * stored at accumulator by h^k and XORs the evaluated polynomial into it. What is 'k'? > + * > + * Computes h^k*accumulator + h^kM_0 + ... + h^1M_{k-1} (No constant term) > + * > + * rdi (OP1) - pointer to message blocks > + * rsi - pointer to precomputed key struct > + * rdx - number of blocks to hash > + * rcx - location to XOR with evaluated polynomial > + * > + * void clmul_polyval_update(const u8 *in, const struct polyhash_key* keys, > + * size_t nblocks, ble128* accumulator); > + */ struct polyhash_key isn't defined anywhere. > diff --git a/arch/x86/crypto/polyval-clmulni-intel_glue.c b/arch/x86/crypto/polyval-clmulni-intel_glue.c > new file mode 100644 > index 000000000000..64a432b67b49 > --- /dev/null > +++ b/arch/x86/crypto/polyval-clmulni-intel_glue.c > @@ -0,0 +1,165 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Accelerated POLYVAL implementation with Intel PCLMULQDQ-NI > + * instructions. This file contains glue code. > + * > + * Copyright (c) 2007 Nokia Siemens Networks - Mikko Herranen <mh1@xxxxxx> > + * Copyright (c) 2009 Intel Corp. > + * Author: Huang Ying <ying.huang@xxxxxxxxx> > + * Copyright 2021 Google LLC > + */ > +/* > + * Glue code based on ghash-clmulni-intel_glue.c. > + * > + * This implementation of POLYVAL uses montgomery multiplication > + * accelerated by PCLMULQDQ-NI to implement the finite field > + * operations. > + * > + */ > + > +#include <crypto/algapi.h> > +#include <crypto/gf128mul.h> > +#include <crypto/internal/hash.h> > +#include <linux/crypto.h> > +#include <linux/init.h> > +#include <linux/kernel.h> > +#include <linux/module.h> > +#include <asm/simd.h> > + > +#define POLYVAL_BLOCK_SIZE 16 > +#define POLYVAL_DIGEST_SIZE 16 How about including <crypto/polyval.h> (added by an earlier patch) to get these definitions? > +#define NUM_PRECOMPUTE_POWERS 8 > + > +struct polyval_ctx { > + be128 key_powers[NUM_PRECOMPUTE_POWERS]; > +}; There should be a comment that says what order the key_powers are in. Also why is the type be128? These aren't big endian. > +static int polyval_setkey(struct crypto_shash *tfm, > + const u8 *key, unsigned int keylen) > +{ > + struct polyval_ctx *ctx = crypto_shash_ctx(tfm); > + int i; > + > + if (keylen != POLYVAL_BLOCK_SIZE) > + return -EINVAL; This could use a: BUILD_BUG_ON(POLYVAL_BLOCK_SIZE != sizeof(be128)); > + > + memcpy(&ctx->key_powers[NUM_PRECOMPUTE_POWERS-1], key, sizeof(be128)); > + > + for (i = NUM_PRECOMPUTE_POWERS-2; i >= 0; i--) { > + memcpy(&ctx->key_powers[i], key, sizeof(be128)); > + clmul_polyval_mul(&ctx->key_powers[i], &ctx->key_powers[i+1]); > + } It appears this is using the SIMD registers without first executing kernel_fpu_begin(), which isn't valid. > +static int polyval_update(struct shash_desc *desc, > + const u8 *src, unsigned int srclen) > +{ > + struct polyval_desc_ctx *dctx = shash_desc_ctx(desc); > + struct polyval_ctx *ctx = crypto_shash_ctx(desc->tfm); > + u8 *dst = dctx->buffer; > + u8 *pos; > + unsigned int nblocks; > + int n; > + > + kernel_fpu_begin(); > + if (dctx->bytes) { > + n = min(srclen, dctx->bytes); > + pos = dst + POLYVAL_BLOCK_SIZE - dctx->bytes; > + > + dctx->bytes -= n; > + srclen -= n; > + > + while (n--) > + *pos++ ^= *src++; > + > + if (!dctx->bytes) > + clmul_polyval_mul((be128 *)dst, &ctx->key_powers[NUM_PRECOMPUTE_POWERS-1]); Casting u8 to be128 violates alignment rules. Given that clmul_polyval_mul() uses the unaligned load/store instructions on this argument, its type should be a byte pointer. > +static int polyval_final(struct shash_desc *desc, u8 *dst) > +{ > + struct polyval_desc_ctx *dctx = shash_desc_ctx(desc); > + struct polyval_ctx *ctx = crypto_shash_ctx(desc->tfm); > + u8 *buf = dctx->buffer; > + > + if (dctx->bytes) { > + kernel_fpu_begin(); > + clmul_polyval_mul((be128 *)dst, &ctx->key_powers[NUM_PRECOMPUTE_POWERS-1]); > + kernel_fpu_end(); > + } The above call to clmul_polyval_mul() is incorrect as it is reading from *dst before writing to it. Presumably non-block-multiple messages aren't being tested? I don't think that such messages make sense, so how about returning an error in that case instead? > +static struct shash_alg polyval_alg = { > + .digestsize = POLYVAL_DIGEST_SIZE, > + .init = polyval_init, > + .update = polyval_update, > + .final = polyval_final, > + .setkey = polyval_setkey, > + .descsize = sizeof(struct polyval_desc_ctx), > + .base = { > + .cra_name = "polyval", > + .cra_driver_name = "polyval-pclmulqdqni", How about "polyval-clmulni", like "ghash-clmulni"? pclmulqdqni is a mouthful. > + .cra_priority = 200, > + .cra_blocksize = POLYVAL_BLOCK_SIZE, > + .cra_ctxsize = sizeof(struct polyval_ctx), > + .cra_module = THIS_MODULE, > + }, > +}; > + > +static int __init polyval_mod_init(void) > +{ > + return crypto_register_shash(&polyval_alg); > +} > + > +static void __exit polyval_mod_exit(void) > +{ > + crypto_unregister_shash(&polyval_alg); > +} Hmm, so this isn't being wrapped with an ahash like the ghash implementation is. Unfortunately, I don't think that's allowed, since you are assuming that the code is always called in a context where SIMD instructions are usable. I don't think that's the case on x86; the other x86 crypto code goes to some length to avoid this. Unless anyone else has any better idea, I think you'll have to make the shash an internal algorithm, and wrap it with an ahash algorithm, like "ghash-clmulni" does. Ideally you'd refactor the ahash helper code from ghash-clmulni into crypto/simd.c, as otherwise you'll need to copy+paste essentially. > + > +subsys_initcall(polyval_mod_init); > +module_exit(polyval_mod_exit); > + > +MODULE_LICENSE("GPL"); > +MODULE_DESCRIPTION("POLYVAL hash function accelerated by PCLMULQDQ-NI"); > +MODULE_ALIAS_CRYPTO("polyval"); A MODULE_ALIAS_CRYPTO for the cra_driver_name should be added too. - Eric