Cool patch! I look forward to getting out the old arm32 rig and benching this. One question: On Sun, Nov 1, 2020 at 5:33 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > On out-of-order microarchitectures such as Cortex-A57, this results in > a speedup for 1420 byte blocks of about 21%, without any signficant > performance impact of the power-of-2 block sizes. On lower end cores > such as Cortex-A53, the speedup for 1420 byte blocks is only about 2%, > but also without impacting other input sizes. A57 and A53 are 64-bit, but this is code for 32-bit arm, right? So the comparison is more like A15 vs A5? Or are you running 32-bit kernels on armv8 hardware?