On Fri, 4 Oct 2019 at 15:53, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > Expose the accelerated NEON ChaCha routine directly as a symbol > > export so that users of the ChaCha library can use it directly. > > Eric had some nice code for ChaCha for certain ARM cores that lived in > Zinc as chacha20-unrolled-arm.S. This code became active for certain > cores where NEON was bad and for cores with no NEON. The condition for > it was: > > switch (read_cpuid_part()) { > case ARM_CPU_PART_CORTEX_A7: > case ARM_CPU_PART_CORTEX_A5: > /* The Cortex-A7 and Cortex-A5 do not perform well with the NEON > * implementation but do incredibly with the scalar one and use > * less power. > */ > break; > default: > chacha20_use_neon = elf_hwcap & HWCAP_NEON; > } > > ... How is it relevant whether the boot CPU is A5 or A7? These are bL little cores that only implement NEON for feature parity with their bl big counterparts, but CPU intensive tasks are scheduled on big cores, where NEON performance is much better than scalar. If we need a policy for this in the kernel, I'd prefer it to be one at the arch/arm level where we disable kernel mode NEON entirely, either via a command line option, or via a policy based on the the types of all CPUs. > > for (;;) { > if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon && > len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context)) { > const size_t bytes = min_t(size_t, len, PAGE_SIZE); > > chacha20_neon(dst, src, bytes, ctx->key, ctx->counter); > ctx->counter[0] += (bytes + 63) / 64; > len -= bytes; > if (!len) > break; > dst += bytes; > src += bytes; > simd_relax(simd_context); > } else { > chacha20_arm(dst, src, len, ctx->key, ctx->counter); > ctx->counter[0] += (len + 63) / 64; > break; > } > } > > It's another instance in which the generic code was totally optimized > out of Zinc builds. > > Did these changes make it into the existing tree? I'd like to keep Eric's code, but if it is really that much faster, we might drop it in arch/arm/lib so it supersedes the builtin code that /dev/random uses as well.