On Mon, 2 Nov 2020 at 01:30, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > Cool patch! I look forward to getting out the old arm32 rig and > benching this. One question: > > On Sun, Nov 1, 2020 at 5:33 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > On out-of-order microarchitectures such as Cortex-A57, this results in > > a speedup for 1420 byte blocks of about 21%, without any signficant > > performance impact of the power-of-2 block sizes. On lower end cores > > such as Cortex-A53, the speedup for 1420 byte blocks is only about 2%, > > but also without impacting other input sizes. > > A57 and A53 are 64-bit, but this is code for 32-bit arm, right? So the > comparison is more like A15 vs A5? Or are you running 32-bit kernels > on armv8 hardware? The latter. The only 32-bit hardware I have in my drawer is Cortex-A8, which I expect to benefit from this change, but the way its micro-architecture integrates the NEON stages into the pipeline is a bit odd, and therefore, you cannot really extrapolate from those results for other cores. Cortex-A57 and Cortex-A15 should be fairly similar, so that is really the target for this optimization. Cortex-A5 and A7 already omit the NEON code path entirely, so they are not affected in the first place. Cortex-A53 is significant because this is what the Raspberry Pi3 uses (and it ships with a 32-bit kernel)