Re: [PATCH v3] crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support

Ard Biesheuvel <ardb@xxxxxxxxxx> · Sat, 8 Feb 2025 10:07:20 +0100

Hi Eric,

On Sat, 8 Feb 2025 at 04:52, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>
> From: Eric Biggers <ebiggers@xxxxxxxxxx>
>
> Delete aes_ctrby8_avx-x86_64.S and add a new assembly file
> aes-ctr-avx-x86_64.S which follows a similar approach to
> aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX,
> VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just
> AESNI+AVX.  Wire it up to the crypto API accordingly.
>
> This greatly improves the performance of AES-CTR and AES-XCTR on
> VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230%
> increase in throughput is seen on long messages.  Performance on
> non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR
> code (aesni_ctr_enc) is also kept as-is for now.  There are some slight
> regressions (less than 10%) on some short message lengths on some CPUs;
> these are difficult to avoid, given how the previous code was so heavily
> unrolled by message length, and they are not particularly important.
> Detailed performance results are given in the tables below.
>
> Both CTR and XCTR support is retained.  The main loop remains
> 8-vector-wide, which differs from the 4-vector-wide main loops that are
> used in the XTS and GCM code.  A wider loop is appropriate for CTR and
> XCTR since they have fewer other instructions (such as vpclmulqdq) to
> interleave with the AES instructions.
>
> Similar to what was the case for AES-GCM, the new assembly code also has
> a much smaller binary size, as it fixes the excessive unrolling by data
> length and key length present in the old code.  Specifically, the new
> assembly file compiles to about 9 KB of text vs. 28 KB for the old file.
> This is despite 4x as many implementations being included.
>
> The tables below show the detailed performance results.  The tables show
> percentage improvement in single-threaded throughput for repeated
> encryption of the given message length; an increase from 6000 MB/s to
> 12000 MB/s would be listed as 100%.  They were collected by directly
> measuring the Linux crypto API performance using a custom kernel module.
> The tested CPUs were all server processors from Google Compute Engine
> except for Zen 5 which was a Ryzen 9 9950X desktop processor.
>
> Table 1: AES-256-CTR throughput improvement,
>          CPU microarchitecture vs. message length in bytes:
>
>                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
> ---------------------+-------+-------+-------+-------+-------+-------+
> AMD Zen 5            |  232% |  203% |  212% |  143% |   71% |   95% |
> Intel Emerald Rapids |  116% |  116% |  117% |   91% |   78% |   79% |
> Intel Ice Lake       |  109% |  103% |  107% |   81% |   54% |   56% |
> AMD Zen 4            |  109% |   91% |  100% |   70% |   43% |   59% |
> AMD Zen 3            |   92% |   78% |   87% |   57% |   32% |   43% |
> AMD Zen 2            |    9% |    8% |   14% |   12% |    8% |   21% |
> Intel Skylake        |    7% |    7% |    8% |    5% |    3% |    8% |
>
>                      |   300 |   200 |    64 |    63 |    16 |
> ---------------------+-------+-------+-------+-------+-------+
> AMD Zen 5            |   57% |   39% |   -9% |    7% |   -7% |
> Intel Emerald Rapids |   37% |   42% |   -0% |   13% |   -8% |
> Intel Ice Lake       |   39% |   30% |   -1% |   14% |   -9% |
> AMD Zen 4            |   42% |   38% |   -0% |   18% |   -3% |
> AMD Zen 3            |   38% |   35% |    6% |   31% |    5% |
> AMD Zen 2            |   24% |   23% |    5% |   30% |    3% |
> Intel Skylake        |    9% |    1% |   -4% |   10% |   -7% |
>
> Table 2: AES-256-XCTR throughput improvement,
>          CPU microarchitecture vs. message length in bytes:
>
>                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
> ---------------------+-------+-------+-------+-------+-------+-------+
> AMD Zen 5            |  240% |  201% |  216% |  151% |   75% |  108% |
> Intel Emerald Rapids |  100% |   99% |  102% |   91% |   94% |  104% |
> Intel Ice Lake       |   93% |   89% |   92% |   74% |   50% |   64% |
> AMD Zen 4            |   86% |   75% |   83% |   60% |   41% |   52% |
> AMD Zen 3            |   73% |   63% |   69% |   45% |   21% |   33% |
> AMD Zen 2            |   -2% |   -2% |    2% |    3% |   -1% |   11% |
> Intel Skylake        |   -1% |   -1% |    1% |    2% |   -1% |    9% |
>
>                      |   300 |   200 |    64 |    63 |    16 |
> ---------------------+-------+-------+-------+-------+-------+
> AMD Zen 5            |   78% |   56% |   -4% |   38% |   -2% |
> Intel Emerald Rapids |   61% |   55% |    4% |   32% |   -5% |
> Intel Ice Lake       |   57% |   42% |    3% |   44% |   -4% |
> AMD Zen 4            |   35% |   28% |   -1% |   17% |   -3% |
> AMD Zen 3            |   26% |   23% |   -3% |   11% |   -6% |
> AMD Zen 2            |   13% |   24% |   -1% |   14% |   -3% |
> Intel Skylake        |   16% |    8% |   -4% |   35% |   -3% |
>
> Signed-off-by: Eric Biggers <ebiggers@xxxxxxxxxx>
> ---
>

Very nice results! One remark below.
...
> diff --git a/arch/x86/crypto/aes-ctr-avx-x86_64.S b/arch/x86/crypto/aes-ctr-avx-x86_64.S
> new file mode 100644
> index 0000000000000..25cab1d8e63f9
> --- /dev/null
> +++ b/arch/x86/crypto/aes-ctr-avx-x86_64.S
> @@ -0,0 +1,592 @@
> +/* SPDX-License-Identifier: Apache-2.0 OR BSD-2-Clause */
> +//
> +// Copyright 2025 Google LLC
> +//
> +// Author: Eric Biggers <ebiggers@xxxxxxxxxx>
> +//
> +// This file is dual-licensed, meaning that you can use it under your choice of
> +// either of the following two licenses:
> +//
> +// Licensed under the Apache License 2.0 (the "License").  You may obtain a copy
> +// of the License at
> +//
> +//     http://www.apache.org/licenses/LICENSE-2.0
> +//
> +// Unless required by applicable law or agreed to in writing, software
> +// distributed under the License is distributed on an "AS IS" BASIS,
> +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +// See the License for the specific language governing permissions and
> +// limitations under the License.
> +//
> +// or
> +//
> +// Redistribution and use in source and binary forms, with or without
> +// modification, are permitted provided that the following conditions are met:
> +//
> +// 1. Redistributions of source code must retain the above copyright notice,
> +//    this list of conditions and the following disclaimer.
> +//
> +// 2. Redistributions in binary form must reproduce the above copyright
> +//    notice, this list of conditions and the following disclaimer in the
> +//    documentation and/or other materials provided with the distribution.
> +//
> +// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> +// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> +// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> +// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
> +// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> +// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> +// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> +// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
> +// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> +// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> +// POSSIBILITY OF SUCH DAMAGE.
> +//
> +//------------------------------------------------------------------------------
> +//
> +// This file contains x86_64 assembly implementations of AES-CTR and AES-XCTR
> +// using the following sets of CPU features:
> +//     - AES-NI && AVX
> +//     - VAES && AVX2
> +//     - VAES && (AVX10/256 || (AVX512BW && AVX512VL)) && BMI2
> +//     - VAES && (AVX10/512 || (AVX512BW && AVX512VL)) && BMI2
> +//
> +// See the function definitions at the bottom of the file for more information.
> +
> +#include <linux/linkage.h>
> +#include <linux/cfi_types.h>
> +
> +.section .rodata
> +.p2align 4
> +
> +.Lbswap_mask:
> +       .octa   0x000102030405060708090a0b0c0d0e0f
> +
> +.Lctr_pattern:
> +       .quad   0, 0
> +.Lone:
> +       .quad   1, 0
> +.Ltwo:
> +       .quad   2, 0
> +       .quad   3, 0
> +
> +.Lfour:
> +       .quad   4, 0
> +
> +.text
> +
> +// Move a vector between memory and a register.
> +// The register operand must be in the first 16 vector registers.
> +.macro _vmovdqu        src, dst
> +.if VL < 64
> +       vmovdqu         \src, \dst
> +.else
> +       vmovdqu8        \src, \dst
> +.endif
> +.endm
> +
> +// Move a vector between registers.
> +// The registers must be in the first 16 vector registers.
> +.macro _vmovdqa        src, dst
> +.if VL < 64
> +       vmovdqa         \src, \dst
> +.else
> +       vmovdqa64       \src, \dst
> +.endif
> +.endm
> +
> +// Broadcast a 128-bit value from memory to all 128-bit lanes of a vector
> +// register.  The register operand must be in the first 16 vector registers.
> +.macro _vbroadcast128  src, dst
> +.if VL == 16
> +       vmovdqu         \src, \dst
> +.elseif VL == 32
> +       vbroadcasti128  \src, \dst
> +.else
> +       vbroadcasti32x4 \src, \dst
> +.endif
> +.endm
> +
> +// XOR two vectors together.
> +// Any register operands must be in the first 16 vector registers.
> +.macro _vpxor  src1, src2, dst
> +.if VL < 64
> +       vpxor           \src1, \src2, \dst
> +.else
> +       vpxord          \src1, \src2, \dst
> +.endif
> +.endm
> +
> +// Load 1 <= %ecx <= 15 bytes from the pointer \src into the xmm register \dst
> +// and zeroize any remaining bytes.  Clobbers %rax, %rcx, and \tmp{64,32}.
> +.macro _load_partial_block     src, dst, tmp64, tmp32
> +       sub             $8, %ecx                // LEN - 8
> +       jle             .Lle8\@
> +
> +       // Load 9 <= LEN <= 15 bytes.
> +       vmovq           (\src), \dst            // Load first 8 bytes
> +       mov             (\src, %rcx), %rax      // Load last 8 bytes
> +       neg             %ecx
> +       shl             $3, %ecx
> +       shr             %cl, %rax               // Discard overlapping bytes
> +       vpinsrq         $1, %rax, \dst, \dst
> +       jmp             .Ldone\@
> +
> +.Lle8\@:
> +       add             $4, %ecx                // LEN - 4
> +       jl              .Llt4\@
> +
> +       // Load 4 <= LEN <= 8 bytes.
> +       mov             (\src), %eax            // Load first 4 bytes
> +       mov             (\src, %rcx), \tmp32    // Load last 4 bytes
> +       jmp             .Lcombine\@
> +
> +.Llt4\@:
> +       // Load 1 <= LEN <= 3 bytes.
> +       add             $2, %ecx                // LEN - 2
> +       movzbl          (\src), %eax            // Load first byte
> +       jl              .Lmovq\@
> +       movzwl          (\src, %rcx), \tmp32    // Load last 2 bytes
> +.Lcombine\@:
> +       shl             $3, %ecx
> +       shl             %cl, \tmp64
> +       or              \tmp64, %rax            // Combine the two parts
> +.Lmovq\@:
> +       vmovq           %rax, \dst
> +.Ldone\@:
> +.endm
> +
> +// Store 1 <= %ecx <= 15 bytes from the xmm register \src to the pointer \dst.
> +// Clobbers %rax, %rcx, and \tmp{64,32}.
> +.macro _store_partial_block    src, dst, tmp64, tmp32
> +       sub             $8, %ecx                // LEN - 8
> +       jl              .Llt8\@
> +
> +       // Store 8 <= LEN <= 15 bytes.
> +       vpextrq         $1, \src, %rax
> +       mov             %ecx, \tmp32
> +       shl             $3, %ecx
> +       ror             %cl, %rax
> +       mov             %rax, (\dst, \tmp64)    // Store last LEN - 8 bytes
> +       vmovq           \src, (\dst)            // Store first 8 bytes
> +       jmp             .Ldone\@
> +
> +.Llt8\@:
> +       add             $4, %ecx                // LEN - 4
> +       jl              .Llt4\@
> +
> +       // Store 4 <= LEN <= 7 bytes.
> +       vpextrd         $1, \src, %eax
> +       mov             %ecx, \tmp32
> +       shl             $3, %ecx
> +       ror             %cl, %eax
> +       mov             %eax, (\dst, \tmp64)    // Store last LEN - 4 bytes
> +       vmovd           \src, (\dst)            // Store first 4 bytes
> +       jmp             .Ldone\@
> +
> +.Llt4\@:
> +       // Store 1 <= LEN <= 3 bytes.
> +       vpextrb         $0, \src, 0(\dst)
> +       cmp             $-2, %ecx               // LEN - 4 == -2, i.e. LEN == 2?
> +       jl              .Ldone\@
> +       vpextrb         $1, \src, 1(\dst)
> +       je              .Ldone\@
> +       vpextrb         $2, \src, 2(\dst)
> +.Ldone\@:
> +.endm
> +
> +// Prepare the next two vectors of AES inputs in AESDATA\i0 and AESDATA\i1, and
> +// XOR each with the zero-th round key.  Also update LE_CTR if !\final.
> +.macro _prepare_2_ctr_vecs     is_xctr, i0, i1, final=0
> +.if \is_xctr
> +  .if USE_AVX10
> +       _vmovdqa        LE_CTR, AESDATA\i0
> +       vpternlogd      $0x96, XCTR_IV, RNDKEY0, AESDATA\i0
> +  .else
> +       vpxor           XCTR_IV, LE_CTR, AESDATA\i0
> +       vpxor           RNDKEY0, AESDATA\i0, AESDATA\i0
> +  .endif
> +       vpaddq          LE_CTR_INC1, LE_CTR, AESDATA\i1
> +
> +  .if USE_AVX10
> +       vpternlogd      $0x96, XCTR_IV, RNDKEY0, AESDATA\i1
> +  .else
> +       vpxor           XCTR_IV, AESDATA\i1, AESDATA\i1
> +       vpxor           RNDKEY0, AESDATA\i1, AESDATA\i1
> +  .endif
> +.else
> +       vpshufb         BSWAP_MASK, LE_CTR, AESDATA\i0
> +       _vpxor          RNDKEY0, AESDATA\i0, AESDATA\i0
> +       vpaddq          LE_CTR_INC1, LE_CTR, AESDATA\i1
> +       vpshufb         BSWAP_MASK, AESDATA\i1, AESDATA\i1
> +       _vpxor          RNDKEY0, AESDATA\i1, AESDATA\i1
> +.endif
> +.if !\final
> +       vpaddq          LE_CTR_INC2, LE_CTR, LE_CTR
> +.endif
> +.endm
> +
> +// Do all AES rounds on the data in the given AESDATA vectors, excluding the
> +// zero-th and last rounds.
> +.macro _aesenc_loop    vecs

If you make this vecs:vararg, you can drop the "" around the arguments
in the callers.

> +       mov             KEY, %rax
> +1:
> +       _vbroadcast128  (%rax), RNDKEY
> +.irp i, \vecs
> +       vaesenc         RNDKEY, AESDATA\i, AESDATA\i
> +.endr
> +       add             $16, %rax
> +       cmp             %rax, RNDKEYLAST_PTR
> +       jne             1b
> +.endm
> +
> +// Finalize the keystream blocks in the given AESDATA vectors by doing the last
> +// AES round, then XOR those keystream blocks with the corresponding data.
> +// Reduce latency by doing the XOR before the vaesenclast, utilizing the
> +// property vaesenclast(key, a) ^ b == vaesenclast(key ^ b, a).
> +.macro _aesenclast_and_xor     vecs

Same here

> +.irp i, \vecs
> +       _vpxor          \i*VL(SRC), RNDKEYLAST, RNDKEY
> +       vaesenclast     RNDKEY, AESDATA\i, AESDATA\i
> +.endr
> +.irp i, \vecs
> +       _vmovdqu        AESDATA\i, \i*VL(DST)
> +.endr
> +.endm
> +

...