On Thu, Sep 19, 2024 at 05:13:59PM +0800, Xi Ruoyao wrote: > As Christophe pointed out, tuning the chacha20 implementation by > scheduling the instructions like what GCC does can improve the > performance. > > The tuning does not introduce too much complexity (basically it's just > reordering some instructions). And the tuning does not hurt readibility > too much: actually the tuned code looks even more similar to a > textbook-style implementation based on 128-bit vectors. So overall it's > a good deal to me. > > Tested with vdso_test_getchacha and benched with vdso_test_getrandom. > On a LA664 the speedup is 5%, and I expect a larger speedup on LA[2-4]64 > with a lower issue rate. > > Suggested-by: Christophe Leroy <christophe.leroy@xxxxxxxxxx> > Link: https://lore.kernel.org/all/77655d9e-fc05-4300-8f0d-7b2ad840d091@xxxxxxxxxx/ > Signed-off-by: Xi Ruoyao <xry111@xxxxxxxxxxx> That seems like a reasonable optimization to me. I'll queue it up in random.git and send it in my pull next week. Thanks. Jason