Re: [PATCH 3/6] crypto: aegis - avoid prerotated AES tables

Ondrej Mosnacek <omosnace@xxxxxxxxxx> · Mon, 24 Jun 2019 10:13:02 +0200

On Mon, Jun 24, 2019 at 9:38 AM Ard Biesheuvel
<ard.biesheuvel@xxxxxxxxxx> wrote:
> The generic AES code provides four sets of lookup tables, where each
> set consists of four tables containing the same 32-bit values, but
> rotated by 0, 8, 16 and 24 bits, respectively. This makes sense for
> CISC architectures such as x86 which support memory operands, but
> for other architectures, the rotates are quite cheap, and using all
> four tables needlessly thrashes the D-cache, and actually hurts rather
> than helps performance.
>
> Since x86 already has its own implementation of AEGIS based on AES-NI
> instructions, let's tweak the generic implementation towards other
> architectures, and avoid the prerotated tables, and perform the
> rotations inline. On ARM Cortex-A53, this results in a ~8% speedup.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>

I'm not an expert on low-level performance, but the rationale sounds reasonable.

Acked-by: Ondrej Mosnacek <omosnace@xxxxxxxxxx>

> ---
>  crypto/aegis.h | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/crypto/aegis.h b/crypto/aegis.h
> index 41a3090cda8e..3308066ddde0 100644
> --- a/crypto/aegis.h
> +++ b/crypto/aegis.h
> @@ -10,6 +10,7 @@
>  #define _CRYPTO_AEGIS_H
>
>  #include <crypto/aes.h>
> +#include <linux/bitops.h>
>  #include <linux/types.h>
>
>  #define AEGIS_BLOCK_SIZE 16
> @@ -53,16 +54,13 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
>                                 const union aegis_block *key)
>  {
>         const u8  *s  = src->bytes;
> -       const u32 *t0 = crypto_ft_tab[0];
> -       const u32 *t1 = crypto_ft_tab[1];
> -       const u32 *t2 = crypto_ft_tab[2];
> -       const u32 *t3 = crypto_ft_tab[3];
> +       const u32 *t = crypto_ft_tab[0];
>         u32 d0, d1, d2, d3;
>
> -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]];
> -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]];
> -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]];
> -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]];
> +       d0 = t[s[ 0]] ^ rol32(t[s[ 5]], 8) ^ rol32(t[s[10]], 16) ^ rol32(t[s[15]], 24);
> +       d1 = t[s[ 4]] ^ rol32(t[s[ 9]], 8) ^ rol32(t[s[14]], 16) ^ rol32(t[s[ 3]], 24);
> +       d2 = t[s[ 8]] ^ rol32(t[s[13]], 8) ^ rol32(t[s[ 2]], 16) ^ rol32(t[s[ 7]], 24);
> +       d3 = t[s[12]] ^ rol32(t[s[ 1]], 8) ^ rol32(t[s[ 6]], 16) ^ rol32(t[s[11]], 24);
>
>         dst->words32[0] = cpu_to_le32(d0) ^ key->words32[0];
>         dst->words32[1] = cpu_to_le32(d1) ^ key->words32[1];
> --
> 2.20.1
>

-- 
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.