Hi Eric, On Tue, Aug 14, 2018 at 2:12 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per > byte), whereas Linux's is faster: 11.9 cpb. > > The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's > > I understand there are tradeoffs, and different implementations can be faster on > different CPUs. > > So if your proposal goes in, I'd likely need to write a patch > to get the old performance back, at least on Cortex-A7... Yes, absolutely. Different CPUs behave differently indeed, but if you have improvements for hardware that matters to you, we should certainly incorporate these, and also loop Andy Polyakov in (I've added him to the CC for the WIP v2). ChaCha is generally pretty obvious, but for big integer algorithms -- like Poly1305 and Curve25519 -- I think it's all the more important to involve Andy and the rest of the world in general, so that Linux benefits from bug research and fuzzing in places that are typically and classically prone to nasty issues. In other words, let's definitely incorporate your improvements after the patchset goes in, and at the same time we'll try to bring Andy and others into the fold, where our improvements can generally track each others. > Also, I don't know whether Andy P. considered the 4xNEON implementation > technique. It could even be fastest on other ARM CPUs too, I don't know. After v2, when he's CC'd in, let's plan to start discussing this with him. Jason