Re: Flaw in "random32: update the net random state on interrupt and activity"

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Fri, 7 Aug 2020 12:33:33 -0700

> On Aug 7, 2020, at 12:21 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> 
> On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> 4 cycles per byte on Core 2
> 
> I took the reference C implementation as-is, and just compiled it with
> O2, so my numbers may not be what some heavily optimized case does.
> 
> But it was way more than that, even when amortizing for "only need to
> do it every 8 cases". I think the 4 cycles/byte might be some "zero
> branch mispredicts" case when you've fully unrolled the thing, but
> then you'll be taking I$ misses out of the wazoo, since by definition
> this won't be in your L1 I$ at all (only called every 8 times).
> 
> Sure, it might look ok on microbenchmarks where it does stay hot the
> cache all the time, but that's not realistic. I

No one said we have to do only one ChaCha20 block per slow path hit.  In fact, the more we reduce the number of rounds, the more time we spend on I$ misses, branch mispredictions, etc, so reducing rounds may be barking up the wrong tree entirely.  We probably don’t want to have more than one page 

I wonder if AES-NI adds any value here.  AES-CTR is almost a drop-in replacement for ChaCha20, and maybe the performance for a cache-cold short run is better.