On Fri, Nov 06, 2020 at 05:39:38PM +0100, Ard Biesheuvel wrote: > Based on lessons learnt from optimizing the 32-bit version of this driver, > we can simplify the arm64 version considerably, by reordering the final > two stores when the last block is not a multiple of 64 bytes. This removes > the need to use permutation instructions to calculate the elements that are > clobbered by the final overlapping store, given that the store of the > penultimate block now follows it, and that one carries the correct values > for those elements already. > > While at it, simplify the overlapping loads as well, by calculating the > address of the final overlapping load upfront, and switching to this > address for every load that would otherwise extend past the end of the > source buffer. > > There is no impact on performance, but the resulting code is substantially > smaller and easier to follow. > > Cc: Eric Biggers <ebiggers@xxxxxxxxxx> > Cc: "Jason A . Donenfeld" <Jason@xxxxxxxxx> > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx> > --- > arch/arm64/crypto/chacha-neon-core.S | 193 +++++++------------- > 1 file changed, 69 insertions(+), 124 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt