On Tue, Mar 15, 2022 at 11:00:34PM +0000, Nathan Huckleberry wrote: > Add hardware accelerated version of POLYVAL for ARM64 CPUs with > Crypto Extension support. Nit: It's "Crypto Extensions", not "Crypto Extension". > +config CRYPTO_POLYVAL_ARM64_CE > + tristate "POLYVAL using ARMv8 Crypto Extensions (for HCTR2)" > + depends on KERNEL_MODE_NEON > + select CRYPTO_CRYPTD > + select CRYPTO_HASH > + select CRYPTO_POLYVAL CRYPTO_POLYVAL selects CRYPTO_HASH already, so there's no need to select it here. > +/* > + * Perform polynomial evaluation as specified by POLYVAL. This computes: > + * h^n * accumulator + h^n * m_0 + ... + h^1 * m_{n-1} > + * where n=nblocks, h is the hash key, and m_i are the message blocks. > + * > + * x0 - pointer to message blocks > + * x1 - pointer to precomputed key powers h^8 ... h^1 > + * x2 - number of blocks to hash > + * x3 - pointer to accumulator > + * > + * void pmull_polyval_update(const u8 *in, const struct polyval_ctx *ctx, > + * size_t nblocks, u8 *accumulator); > + */ > +SYM_FUNC_START(pmull_polyval_update) > + adr TMP, .Lgstar > + ld1 {GSTAR.2d}, [TMP] > + ld1 {SUM.16b}, [x3] > + ands PARTIAL_LEFT, BLOCKS_LEFT, #7 > + beq .LskipPartial > + partial_stride > +.LskipPartial: > + subs BLOCKS_LEFT, BLOCKS_LEFT, #NUM_PRECOMPUTE_POWERS > + blt .LstrideLoopExit > + ld1 {KEY8.16b, KEY7.16b, KEY6.16b, KEY5.16b}, [x1], #64 > + ld1 {KEY4.16b, KEY3.16b, KEY2.16b, KEY1.16b}, [x1], #64 > + full_stride 0 > + subs BLOCKS_LEFT, BLOCKS_LEFT, #NUM_PRECOMPUTE_POWERS > + blt .LstrideLoopExitReduce > +.LstrideLoop: > + full_stride 1 > + subs BLOCKS_LEFT, BLOCKS_LEFT, #NUM_PRECOMPUTE_POWERS > + bge .LstrideLoop > +.LstrideLoopExitReduce: > + montgomery_reduction > + mov SUM.16b, PH.16b > +.LstrideLoopExit: > + st1 {SUM.16b}, [x3] > + ret > +SYM_FUNC_END(pmull_polyval_update) Is there a reason why partial_stride is done first in the arm64 implementation, but last in the x86 implementation? It would be nice if the implementations worked the same way. Probably last would be better? What is the advantage of doing it first? Besides that, many of the comments I made on the x86 implementation apply to the arm64 implementation too. - Eric