> An SSSE3 implementation of single-block HChaCha20 is also added so > that XChaCha20 can use it rather than the generic > implementation. This required refactoring the ChaCha permutation > into its own function. > [...] > +ENTRY(chacha20_block_xor_ssse3) > + # %rdi: Input state matrix, s > + # %rsi: up to 1 data block output, o > + # %rdx: up to 1 data block input, i > + # %rcx: input/output length in bytes > + > + # x0..3 = s0..3 > + movdqa 0x00(%rdi),%xmm0 > + movdqa 0x10(%rdi),%xmm1 > + movdqa 0x20(%rdi),%xmm2 > + movdqa 0x30(%rdi),%xmm3 > + movdqa %xmm0,%xmm8 > + movdqa %xmm1,%xmm9 > + movdqa %xmm2,%xmm10 > + movdqa %xmm3,%xmm11 > + > + mov %rcx,%rax > + call chacha20_permute > + > # o0 = i0 ^ (x0 + s0) > paddd %xmm8,%xmm0 > cmp $0x10,%rax > @@ -189,6 +198,23 @@ ENTRY(chacha20_block_xor_ssse3) > > ENDPROC(chacha20_block_xor_ssse3) > > +ENTRY(hchacha20_block_ssse3) > + # %rdi: Input state matrix, s > + # %rsi: output (8 32-bit words) > + > + movdqa 0x00(%rdi),%xmm0 > + movdqa 0x10(%rdi),%xmm1 > + movdqa 0x20(%rdi),%xmm2 > + movdqa 0x30(%rdi),%xmm3 > + > + call chacha20_permute AFAIK, the general convention is to create proper stack frames using FRAME_BEGIN/END for non leaf-functions. Should chacha20_permute() callers do so? For the other parts: Reviewed-by: Martin Willi <martin@xxxxxxxxxxxxxx>