On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > > On Aug 7, 2020, at 11:10 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > I tried something very much like that in user space to just see how > > many cycles it ended up being. > > > > I made a "just raw ChaCha20", and it was already much too slow for > > what some of the networking people claim to want. > > Do you remember the numbers? Sorry, no. I wrote a hacky thing in user space, and threw it away. > Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20. That's what I did (well, I did just the 32-bit one), basically emulating percpu accesses for incrementing the offset (I didn't actually *do* percpu accesses, I just did a single-threaded run and used globals, but wrote it with wrappers so that it would look like it might work). > DJB claims 4 cycles per byte on Core 2 I took the reference C implementation as-is, and just compiled it with O2, so my numbers may not be what some heavily optimized case does. But it was way more than that, even when amortizing for "only need to do it every 8 cases". I think the 4 cycles/byte might be some "zero branch mispredicts" case when you've fully unrolled the thing, but then you'll be taking I$ misses out of the wazoo, since by definition this won't be in your L1 I$ at all (only called every 8 times). Sure, it might look ok on microbenchmarks where it does stay hot the cache all the time, but that's not realistic. I Linus