On Fri, Aug 7, 2020 at 12:33 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > No one said we have to do only one ChaCha20 block per slow path hit. Sure, doing more might be better for amortizing the cost. But you have to be very careful about latency spikes. I would be *really* nervous about doing a whole page at a time, when this is called from routines that literally expect it to be less than 50 cycles. So I would seriously suggest you look at a much smaller buffer. Maybe not a single block, but definitely not multiple kB either. Maybe something like 2 cachelines might be ok, but there's a reason the current code only works with 16 bytes (or whatever) and only does simple operations with no looping. That's why I think you might look at a single double-round ChaCha20 instead. Maybe do it for two blocks - by the time you wrap around, you'll have done more than a full ChaCaa20. That would imnsho *much* better than doing some big block, and have huge latency spikes and flush a large portion of your L1 when they happen. Nasty nasty behavior. I really think the whole "we can amortize it with bigger blocks" is complete and utter garbage. It's classic "benchmarketing" crap. Linus