On Thu, May 30, 2024 at 08:38:16PM -0700, Eric Biggers wrote: > On Tue, May 28, 2024 at 02:19:54PM +0200, Jason A. Donenfeld wrote: > > diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S > > new file mode 100644 > > index 000000000000..d79e2bd97598 > > --- /dev/null > > +++ b/arch/x86/entry/vdso/vgetrandom-chacha.S > > @@ -0,0 +1,178 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Copyright (C) 2022 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved. > > + */ > > + > > +#include <linux/linkage.h> > > +#include <asm/frame.h> > > + > > +.section .rodata, "a" > > +.align 16 > > +CONSTANTS: .octa 0x6b20657479622d323320646e61707865 > > +.text > > + > > +/* > > + * Very basic SSE2 implementation of ChaCha20. Produces a given positive number > > + * of blocks of output with a nonce of 0, taking an input key and 8-byte > > + * counter. Importantly does not spill to the stack. Its arguments are: > > + * > > + * rdi: output bytes > > + * rsi: 32-byte key input > > + * rdx: 8-byte counter input/output > > + * rcx: number of 64-byte blocks to write to output > > + */ > > +SYM_FUNC_START(__arch_chacha20_blocks_nostack) > > + > > +.set output, %rdi > > +.set key, %rsi > > +.set counter, %rdx > > +.set nblocks, %rcx > > +.set i, %al > > +/* xmm registers are *not* callee-save. */ > > +.set state0, %xmm0 > > +.set state1, %xmm1 > > +.set state2, %xmm2 > > +.set state3, %xmm3 > > +.set copy0, %xmm4 > > +.set copy1, %xmm5 > > +.set copy2, %xmm6 > > +.set copy3, %xmm7 > > +.set temp, %xmm8 > > +.set one, %xmm9 > > An "interesting" x86_64 quirk: in SSE instructions, registers xmm0-xmm7 take > fewer bytes to encode than xmm8-xmm15. > > Since 'temp' is used frequently, moving it into the lower range (and moving one > of the 'copy' registers, which isn't used as frequently, into the higher range) > decreases the code size of __arch_chacha20_blocks_nostack() by 5%. That's a nice trick. Thank you very much for it. Jason