On Wed, Mar 23, 2022 at 8:37 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > On Tue, Mar 15, 2022 at 11:00:34PM +0000, Nathan Huckleberry wrote: > > Add hardware accelerated version of POLYVAL for ARM64 CPUs with > > Crypto Extension support. > > Nit: It's "Crypto Extensions", not "Crypto Extension". > > > +config CRYPTO_POLYVAL_ARM64_CE > > + tristate "POLYVAL using ARMv8 Crypto Extensions (for HCTR2)" > > + depends on KERNEL_MODE_NEON > > + select CRYPTO_CRYPTD > > + select CRYPTO_HASH > > + select CRYPTO_POLYVAL > > CRYPTO_POLYVAL selects CRYPTO_HASH already, so there's no need to select it > here. > > > +/* > > + * Perform polynomial evaluation as specified by POLYVAL. This computes: > > + * h^n * accumulator + h^n * m_0 + ... + h^1 * m_{n-1} > > + * where n=nblocks, h is the hash key, and m_i are the message blocks. > > + * > > + * x0 - pointer to message blocks > > + * x1 - pointer to precomputed key powers h^8 ... h^1 > > + * x2 - number of blocks to hash > > + * x3 - pointer to accumulator > > + * > > + * void pmull_polyval_update(const u8 *in, const struct polyval_ctx *ctx, > > + * size_t nblocks, u8 *accumulator); > > + */ > > +SYM_FUNC_START(pmull_polyval_update) > > + adr TMP, .Lgstar > > + ld1 {GSTAR.2d}, [TMP] > > + ld1 {SUM.16b}, [x3] > > + ands PARTIAL_LEFT, BLOCKS_LEFT, #7 > > + beq .LskipPartial > > + partial_stride > > +.LskipPartial: > > + subs BLOCKS_LEFT, BLOCKS_LEFT, #NUM_PRECOMPUTE_POWERS > > + blt .LstrideLoopExit > > + ld1 {KEY8.16b, KEY7.16b, KEY6.16b, KEY5.16b}, [x1], #64 > > + ld1 {KEY4.16b, KEY3.16b, KEY2.16b, KEY1.16b}, [x1], #64 > > + full_stride 0 > > + subs BLOCKS_LEFT, BLOCKS_LEFT, #NUM_PRECOMPUTE_POWERS > > + blt .LstrideLoopExitReduce > > +.LstrideLoop: > > + full_stride 1 > > + subs BLOCKS_LEFT, BLOCKS_LEFT, #NUM_PRECOMPUTE_POWERS > > + bge .LstrideLoop > > +.LstrideLoopExitReduce: > > + montgomery_reduction > > + mov SUM.16b, PH.16b > > +.LstrideLoopExit: > > + st1 {SUM.16b}, [x3] > > + ret > > +SYM_FUNC_END(pmull_polyval_update) > > Is there a reason why partial_stride is done first in the arm64 implementation, > but last in the x86 implementation? It would be nice if the implementations > worked the same way. Probably last would be better? What is the advantage of > doing it first? It was so I could return early without loading keys into registers, since I only need them if there's a full stride. I was able to rewrite it in the same way that the x86 implementation works. > > Besides that, many of the comments I made on the x86 implementation apply to the > arm64 implementation too. > > - Eric