Hi, On 10/7/19 11:02 PM, René van Dorst wrote: > Quoting Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>: > >> This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 >> implementation >> for MIPS authored by Andy Polyakov, and contributed by him to the OpenSSL >> project. Formally speaking this is a little bit misleading statement. Cryptogams poly1305-mips module implements both 64- and 32-bit code paths, while what you'll find in OpenSSL is 64-only implementation. But in either case... >> <snip> > > Hi Ard, > > Is it also an option to include my mip32r2 optimized poly1305 version? > > Below the results which shows a good improvement over the Andy Polyakov > version. > I swapped the poly1305 assembly file and rename the function to > <func_name>_mips > Full WireGuard source with the changes [0] > > bytes | RvD | openssl | delta | delta / openssl > ... > 4096 | 9160 | 11755 | -2595 | -22,08% I assume that the presented results depict regression after switch to cryptogams module. Right? RvD implementation distinguishes itself in two ways: 1. some of additions in inner loop are replaced with multiply-by-1-n-add; 2. carry chain at the end of the inner loop is effectively fused with beginning of the said loop/taken out of the loop. I recall attempting 1. and chosen not to do it with following rationale. On processor I have access to, Octeon II, it made no significant difference. It was better, but only marginally. And it's understandable, because Octeon II should have lesser difficulty pairing those additions with multiply-n-add instructions. But since multiplication is an expensive operation, it can be pretty slow, I reckoned that on processor less potent than Octeon II it might be more appropriate to minimize amount of multiplication-n-add instructions. In other words idea is not (and never has been) to get fixated on specific processor at hand, but try to find a sensible compromise that would produce reasonable performance on a range of processors. Of course problem is that it's just an assumption I made here, and it could turn wrong in practice:-) So I wonder which processor do you run on, René? For reference I measure >70MB/sec for 1KB blocks for chacha20poly1305 on 1GHz Octeon II. You report ~34MB/sec, so it ought to be something different. Given second data point it might be appropriate to reconsider and settle for multiply-by-1-n-add. As for 2. I haven't considered it. Since it's a back-to-back dependency chain, if fused with top of the loop, it actually has more promising potential than 1. And it would improve all results, not only MISP32R2. Would you trust me with adopting it to my module? Naturally with due credit. Cheers.