Re: [PATCH] crypto: arm/chacha-neon - optimize for non-block size multiples

Ard Biesheuvel <ardb@xxxxxxxxxx> · Mon, 2 Nov 2020 07:44:09 +0100

On Mon, 2 Nov 2020 at 01:30, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> Cool patch! I look forward to getting out the old arm32 rig and
> benching this. One question:
>
> On Sun, Nov 1, 2020 at 5:33 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
> > On out-of-order microarchitectures such as Cortex-A57, this results in
> > a speedup for 1420 byte blocks of about 21%, without any signficant
> > performance impact of the power-of-2 block sizes. On lower end cores
> > such as Cortex-A53, the speedup for 1420 byte blocks is only about 2%,
> > but also without impacting other input sizes.
>
> A57 and A53 are 64-bit, but this is code for 32-bit arm, right? So the
> comparison is more like A15 vs A5? Or are you running 32-bit kernels
> on armv8 hardware?

The latter. The only 32-bit hardware I have in my drawer is Cortex-A8,
which I expect to benefit from this change, but the way its
micro-architecture integrates the NEON stages into the pipeline is a
bit odd, and therefore, you cannot really extrapolate from those
results for other cores.

Cortex-A57 and Cortex-A15 should be fairly similar, so that is really
the target for this optimization. Cortex-A5 and A7 already omit the
NEON code path entirely, so they are not affected in the first place.
Cortex-A53 is significant because this is what the Raspberry Pi3 uses
(and it ships with a 32-bit kernel)