Hello, On 26.8.2022 8.31, Taehee Yoo wrote:
The purpose of this patchset is to support the implementation of ARIA-AVX. Many of the ideas in this implementation are from Camellia-avx, especially byte slicing. Like Camellia, ARIA also uses a 16way strategy. ARIA cipher algorithm is similar to AES. There are four s-boxes in the ARIA spec and the first and second s-boxes are the same as AES's s-boxes. Almost functions are based on aria-generic code except for s-box related function. The aria-avx doesn't implement the key expanding function. it only support encrypt() and decrypt(). Encryption and Decryption logic is actually the same but it should use separated keys(encryption key and decryption key). En/Decryption steps are like below: 1. Add-Round-Key 2. S-box. 3. Diffusion Layer. There is no special thing in the Add-Round-Key step. There are some notable things in s-box step. Like Camellia, it doesn't use a lookup table, instead, it uses aes-ni. To calculate the first s-box, it just uses the aesenclast and then inverts shift_row. No more process is needed for this job because the first s-box is the same as the AES encryption s-box. To calculate a second s-box(invert of s1), it just uses the aesdeclast and then inverts shift_row. No more process is needed for this job because the second s-box is the same as the AES decryption s-box. To calculate a third and fourth s-boxes, it uses the aesenclast, then inverts shift_row, and affine transformation. The aria-generic implementation is based on a 32-bit implementation, not an 8-bit implementation. The aria-avx Diffusion Layer implementation is based on aria-generic implementation because 8-bit implementation is not fit for parallel implementation but 32-bit is fit for this. The first patch in this series is to export functions for aria-avx. The aria-avx uses existing functions in the aria-generic code. The second patch is to implement aria-avx. The last patch is to add async test for aria. Benchmarks: The tcrypt is used. cpu: i3-12100
This CPU also supports Galois Field New Instructions (GFNI) which are even better suited for accelerating ciphers that use same building blocks as AES. For example, I've recently implemented camellia using GFNI for libgcrypt [1]. I quickly hacked GFNI to your implementation and it gives nice extra bit of performance (~55% faster on Intel tiger-lake). Here's GFNI version of 'aria_sbox_8way', that I used: ///////////////////////////////////////////////////////// #define aria_sbox_8way(x0, x1, x2, x3, \ x4, x5, x6, x7, \ t0, t1, t2, t3, \ t4, t5, t6, t7) \ vpbroadcastq .Ltf_s2_bitmatrix, t0; \ vpbroadcastq .Ltf_inv_bitmatrix, t1; \ vpbroadcastq .Ltf_id_bitmatrix, t2; \ vpbroadcastq .Ltf_aff_bitmatrix, t3; \ vpbroadcastq .Ltf_x2_bitmatrix, t4; \ vgf2p8affineinvqb $(tf_s2_const), t0, x1, x1; \ vgf2p8affineinvqb $(tf_s2_const), t0, x5, x5; \ vgf2p8affineqb $(tf_inv_const), t1, x2, x2; \ vgf2p8affineqb $(tf_inv_const), t1, x6, x6; \ vgf2p8affineinvqb $0, t2, x2, x2; \ vgf2p8affineinvqb $0, t2, x6, x6; \ vgf2p8affineinvqb $(tf_aff_const), t3, x0, x0; \ vgf2p8affineinvqb $(tf_aff_const), t3, x4, x4; \ vgf2p8affineqb $(tf_x2_const), t4, x3, x3; \ vgf2p8affineqb $(tf_x2_const), t4, x7, x7; \ vgf2p8affineinvqb $0, t2, x3, x3; \ vgf2p8affineinvqb $0, t2, x7, x7; #define BV8(a0,a1,a2,a3,a4,a5,a6,a7) \ ( (((a0) & 1) << 0) | \ (((a1) & 1) << 1) | \ (((a2) & 1) << 2) | \ (((a3) & 1) << 3) | \ (((a4) & 1) << 4) | \ (((a5) & 1) << 5) | \ (((a6) & 1) << 6) | \ (((a7) & 1) << 7) ) #define BM8X8(l0,l1,l2,l3,l4,l5,l6,l7) \ ( ((l7) << (0 * 8)) | \ ((l6) << (1 * 8)) | \ ((l5) << (2 * 8)) | \ ((l4) << (3 * 8)) | \ ((l3) << (4 * 8)) | \ ((l2) << (5 * 8)) | \ ((l1) << (6 * 8)) | \ ((l0) << (7 * 8)) ) /* AES affine: */ #define tf_aff_const BV8(1, 1, 0, 0, 0, 1, 1, 0) .Ltf_aff_bitmatrix: .quad BM8X8(BV8(1, 0, 0, 0, 1, 1, 1, 1), BV8(1, 1, 0, 0, 0, 1, 1, 1), BV8(1, 1, 1, 0, 0, 0, 1, 1), BV8(1, 1, 1, 1, 0, 0, 0, 1), BV8(1, 1, 1, 1, 1, 0, 0, 0), BV8(0, 1, 1, 1, 1, 1, 0, 0), BV8(0, 0, 1, 1, 1, 1, 1, 0), BV8(0, 0, 0, 1, 1, 1, 1, 1)) /* AES inverse affine: */ #define tf_inv_const BV8(1, 0, 1, 0, 0, 0, 0, 0) .Ltf_inv_bitmatrix: .quad BM8X8(BV8(0, 0, 1, 0, 0, 1, 0, 1), BV8(1, 0, 0, 1, 0, 0, 1, 0), BV8(0, 1, 0, 0, 1, 0, 0, 1), BV8(1, 0, 1, 0, 0, 1, 0, 0), BV8(0, 1, 0, 1, 0, 0, 1, 0), BV8(0, 0, 1, 0, 1, 0, 0, 1), BV8(1, 0, 0, 1, 0, 1, 0, 0), BV8(0, 1, 0, 0, 1, 0, 1, 0)) /* S2: */ #define tf_s2_const BV8(0, 1, 0, 0, 0, 1, 1, 1) .Ltf_s2_bitmatrix: .quad BM8X8(BV8(0, 1, 0, 1, 0, 1, 1, 1), BV8(0, 0, 1, 1, 1, 1, 1, 1), BV8(1, 1, 1, 0, 1, 1, 0, 1), BV8(1, 1, 0, 0, 0, 0, 1, 1), BV8(0, 1, 0, 0, 0, 0, 1, 1), BV8(1, 1, 0, 0, 1, 1, 1, 0), BV8(0, 1, 1, 0, 0, 0, 1, 1), BV8(1, 1, 1, 1, 0, 1, 1, 0)) /* X2: */ #define tf_x2_const BV8(0, 0, 1, 1, 0, 1, 0, 0) .Ltf_x2_bitmatrix: .quad BM8X8(BV8(0, 0, 0, 1, 1, 0, 0, 0), BV8(0, 0, 1, 0, 0, 1, 1, 0), BV8(0, 0, 0, 0, 1, 0, 1, 0), BV8(1, 1, 1, 0, 0, 0, 1, 1), BV8(1, 1, 1, 0, 1, 1, 0, 0), BV8(0, 1, 1, 0, 1, 0, 1, 1), BV8(1, 0, 1, 1, 1, 1, 0, 1), BV8(1, 0, 0, 1, 0, 0, 1, 1)) /* Identity matrix: */ .Ltf_id_bitmatrix: .quad BM8X8(BV8(1, 0, 0, 0, 0, 0, 0, 0), BV8(0, 1, 0, 0, 0, 0, 0, 0), BV8(0, 0, 1, 0, 0, 0, 0, 0), BV8(0, 0, 0, 1, 0, 0, 0, 0), BV8(0, 0, 0, 0, 1, 0, 0, 0), BV8(0, 0, 0, 0, 0, 1, 0, 0), BV8(0, 0, 0, 0, 0, 0, 1, 0), BV8(0, 0, 0, 0, 0, 0, 0, 1)) ///////////////////////////////////////////////////////// GFNI also allows easy use of 256-bit vector registers so there is way to get additional 2x speed increase (but requires doubling number of parallel processed blocks). -Jussi [1] https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/camellia-aesni-avx2-amd64.h#l80