Hi Martin, On Sat, Dec 01, 2018 at 05:40:40PM +0100, Martin Willi wrote: > > > An SSSE3 implementation of single-block HChaCha20 is also added so > > that XChaCha20 can use it rather than the generic > > implementation. This required refactoring the ChaCha permutation > > into its own function. > > > [...] > > > +ENTRY(chacha20_block_xor_ssse3) > > + # %rdi: Input state matrix, s > > + # %rsi: up to 1 data block output, o > > + # %rdx: up to 1 data block input, i > > + # %rcx: input/output length in bytes > > + > > + # x0..3 = s0..3 > > + movdqa 0x00(%rdi),%xmm0 > > + movdqa 0x10(%rdi),%xmm1 > > + movdqa 0x20(%rdi),%xmm2 > > + movdqa 0x30(%rdi),%xmm3 > > + movdqa %xmm0,%xmm8 > > + movdqa %xmm1,%xmm9 > > + movdqa %xmm2,%xmm10 > > + movdqa %xmm3,%xmm11 > > + > > + mov %rcx,%rax > > + call chacha20_permute > > + > > # o0 = i0 ^ (x0 + s0) > > paddd %xmm8,%xmm0 > > cmp $0x10,%rax > > @@ -189,6 +198,23 @@ ENTRY(chacha20_block_xor_ssse3) > > > > ENDPROC(chacha20_block_xor_ssse3) > > > > +ENTRY(hchacha20_block_ssse3) > > + # %rdi: Input state matrix, s > > + # %rsi: output (8 32-bit words) > > + > > + movdqa 0x00(%rdi),%xmm0 > > + movdqa 0x10(%rdi),%xmm1 > > + movdqa 0x20(%rdi),%xmm2 > > + movdqa 0x30(%rdi),%xmm3 > > + > > + call chacha20_permute > > AFAIK, the general convention is to create proper stack frames using > FRAME_BEGIN/END for non leaf-functions. Should chacha20_permute() > callers do so? > Yes, I'll do that. (Ard suggested similarly in the arm64 version too.) - Eric