Re:

Xi Ruoyao <xry111@xxxxxxxxxxx> · Mon, 19 Aug 2024 23:54:29 +0800

On Mon, 2024-08-19 at 23:22 +0800, Xi Ruoyao wrote:
> On Mon, 2024-08-19 at 13:01 +0000, Jason A. Donenfeld wrote:
> > > I don't see significant improvements about LSX here, so I prefer to
> > > just use the generic version to avoid complexity (I remember Linus
> > > said the whole of __vdso_getrandom is not very useful).
> > 
> > I'm inclined to feel the same way, at least for now. Let's just go with
> > one implementation -- the generic one -- and then we can see if
> > optimization really makes sense later. I suspect the large speedup we're
> > already getting from being in the vDSO is already sufficient for
> > purposes.
> 
> Ok I'll drop the 2nd and 3rd patches in the next version.  But I'm
> puzzled why the LSX implementation isn't much faster, maybe I made some
> mistake in it?

After some thinking this seems making sense: the LoongArch desktop
processors have 4 ALUs able to perform the scalar add/rot/xor
operations, and the throughput is already maximized for ChaCha20 due to
the data dependency.  The advantage of LSX seems just to avoid reloading
key from the memory (because the register file is large enough to hold a
copy of it).

Perhaps LSX will be much better on those embedded processors with 2 ALUs
and 1 SIMD unit (if they don't downclock with heavy SIMD load), but I
don't have one for testing.

-- 
Xi Ruoyao <xry111@xxxxxxxxxxx>
School of Aerospace Science and Technology, Xidian University