Hi, On 10/8/19 1:38 PM, Andy Polyakov wrote: >>> <snip> >> >> Hi Ard, >> >> Is it also an option to include my mip32r2 optimized poly1305 version? >> >> Below the results which shows a good improvement over the Andy Polyakov >> version. >> I swapped the poly1305 assembly file and rename the function to >> <func_name>_mips >> Full WireGuard source with the changes [0] >> >> bytes | RvD | openssl | delta | delta / openssl >> ... >> 4096 | 9160 | 11755 | -2595 | -22,08% Update is pushed to cryptogams. Thanks to René for ideas, feedback and testing! There is even a question about supporting DSP ASE, let's discuss details off-list first. As for multiply-by-1-n-add. > I assume that the presented results depict regression after switch to > cryptogams module. Right? RvD implementation distinguishes itself in two > ways: > > 1. some of additions in inner loop are replaced with multiply-by-1-n-add; > ... > > I recall attempting 1. and chosen not to do it with following rationale. > On processor I have access to, Octeon II, it made no significant > difference. It was better, but only marginally. And it's understandable, > because Octeon II should have lesser difficulty pairing those additions > with multiply-n-add instructions. But since multiplication is an > expensive operation, it can be pretty slow, I reckoned that on processor > less potent than Octeon II it might be more appropriate to minimize > amount of multiplication-n-add instructions. As an example, MIPS 1004K manual discusses that that there are two options for multiplier for this core, proper and poor-man's. Proper multiplier unit can issue multiplication or multiplication-n-add each cycle, with multiplication latency apparently being 4. Poor-man's unit on the other hand can issue multiplication each 32nd[!] cycle with corresponding latency. This means that core with poor-man's unit would perform ~13% worse than it could have been. Updated module does use multiply-by-1-n-add, so this note is effectively for reference in case "poor man" wonders. Cheers.