Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Fri, 4 Oct 2019 16:23:32 +0200

On Fri, 4 Oct 2019 at 15:53, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
> > Expose the accelerated NEON ChaCha routine directly as a symbol
> > export so that users of the ChaCha library can use it directly.
>
> Eric had some nice code for ChaCha for certain ARM cores that lived in
> Zinc as chacha20-unrolled-arm.S. This code became active for certain
> cores where NEON was bad and for cores with no NEON. The condition for
> it was:
>
>         switch (read_cpuid_part()) {
>        case ARM_CPU_PART_CORTEX_A7:
>        case ARM_CPU_PART_CORTEX_A5:
>                /* The Cortex-A7 and Cortex-A5 do not perform well with the NEON
>                 * implementation but do incredibly with the scalar one and use
>                 * less power.
>                 */
>                break;
>        default:
>                chacha20_use_neon = elf_hwcap & HWCAP_NEON;
>        }
>
> ...

How is it relevant whether the boot CPU is A5 or A7? These are bL
little cores that only implement NEON for feature parity with their bl
big counterparts, but CPU intensive tasks are scheduled on big cores,
where NEON performance is much better than scalar.

If we need a policy for this in the kernel, I'd prefer it to be one at
the arch/arm level where we disable kernel mode NEON entirely, either
via a command line option, or via a policy based on the the types of
all CPUs.

>
>         for (;;) {
>                if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon &&
>                    len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context)) {
>                        const size_t bytes = min_t(size_t, len, PAGE_SIZE);
>
>                        chacha20_neon(dst, src, bytes, ctx->key, ctx->counter);
>                        ctx->counter[0] += (bytes + 63) / 64;
>                        len -= bytes;
>                        if (!len)
>                                break;
>                        dst += bytes;
>                        src += bytes;
>                        simd_relax(simd_context);
>                } else {
>                        chacha20_arm(dst, src, len, ctx->key, ctx->counter);
>                        ctx->counter[0] += (len + 63) / 64;
>                        break;
>                }
>        }
>
> It's another instance in which the generic code was totally optimized
> out of Zinc builds.
>
> Did these changes make it into the existing tree?

I'd like to keep Eric's code, but if it is really that much faster, we
might drop it in arch/arm/lib so it supersedes the builtin code that
/dev/random uses as well.