On Mon, 2024-08-19 at 23:22 +0800, Xi Ruoyao wrote: > On Mon, 2024-08-19 at 13:01 +0000, Jason A. Donenfeld wrote: > > > I don't see significant improvements about LSX here, so I prefer to > > > just use the generic version to avoid complexity (I remember Linus > > > said the whole of __vdso_getrandom is not very useful). > > > > I'm inclined to feel the same way, at least for now. Let's just go with > > one implementation -- the generic one -- and then we can see if > > optimization really makes sense later. I suspect the large speedup we're > > already getting from being in the vDSO is already sufficient for > > purposes. > > Ok I'll drop the 2nd and 3rd patches in the next version. But I'm > puzzled why the LSX implementation isn't much faster, maybe I made some > mistake in it? After some thinking this seems making sense: the LoongArch desktop processors have 4 ALUs able to perform the scalar add/rot/xor operations, and the throughput is already maximized for ChaCha20 due to the data dependency. The advantage of LSX seems just to avoid reloading key from the memory (because the register file is large enough to hold a copy of it). Perhaps LSX will be much better on those embedded processors with 2 ALUs and 1 SIMD unit (if they don't downclock with heavy SIMD load), but I don't have one for testing. -- Xi Ruoyao <xry111@xxxxxxxxxxx> School of Aerospace Science and Technology, Xidian University