On Thu, 2024-09-19 at 09:08 +0200, Christophe Leroy wrote: > I know nothing about Loongarch assembly and execution performance, but I > see that GCC groups operations by 4 when building > reference_chacha20_blocks() from vdso_test_chacha, see below. > > Shouldn't you do the same and group ROUNDs by 4 just like I did on > powerpc ? > (https://github.com/torvalds/linux/blob/master/arch/powerpc/kernel/vdso/vgetrandom-chacha.S) Maybe. In theory the scheduling would improve the performance. I'll measure if the scheduling will make an observable performance improvement. -- Xi Ruoyao <xry111@xxxxxxxxxxx> School of Aerospace Science and Technology, Xidian University