> On Aug 7, 2020, at 12:21 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> 4 cycles per byte on Core 2 > > I took the reference C implementation as-is, and just compiled it with > O2, so my numbers may not be what some heavily optimized case does. > > But it was way more than that, even when amortizing for "only need to > do it every 8 cases". I think the 4 cycles/byte might be some "zero > branch mispredicts" case when you've fully unrolled the thing, but > then you'll be taking I$ misses out of the wazoo, since by definition > this won't be in your L1 I$ at all (only called every 8 times). > > Sure, it might look ok on microbenchmarks where it does stay hot the > cache all the time, but that's not realistic. I No one said we have to do only one ChaCha20 block per slow path hit. In fact, the more we reduce the number of rounds, the more time we spend on I$ misses, branch mispredictions, etc, so reducing rounds may be barking up the wrong tree entirely. We probably don’t want to have more than one page I wonder if AES-NI adds any value here. AES-CTR is almost a drop-in replacement for ChaCha20, and maybe the performance for a cache-cold short run is better.