Hi Andy,
Quoting Andy Polyakov <appro@xxxxxxxxxxxxxx>:
Hi,
On 10/8/19 1:38 PM, Andy Polyakov wrote:
<snip>
Hi Ard,
Is it also an option to include my mip32r2 optimized poly1305 version?
Below the results which shows a good improvement over the Andy Polyakov
version.
I swapped the poly1305 assembly file and rename the function to
<func_name>_mips
Full WireGuard source with the changes [0]
bytes | RvD | openssl | delta | delta / openssl
...
4096 | 9160 | 11755 | -2595 | -22,08%
Update is pushed to cryptogams. Thanks to René for ideas, feedback and
testing! There is even a question about supporting DSP ASE, let's
discuss details off-list first.
Thanks!
I see that you have found an other spot to save 1 cycle.
Last results: poly1305: 4096 bytes, 188.671 MB/sec, 9066 cycles
I also wonder if we can also replace the "li $x, -4" and "and $x" with
"sll $x"
combination on other places like [0], also on line 1169?
Replace this on line 1169, works on my device.
- li $in0,-4
srl $ctx,$tmp4,2
- and $in0,$in0,$tmp4
andi $tmp4,$tmp4,3
+ sll $in0, $ctx, 2
addu $ctx,$ctx,$in0
As for multiply-by-1-n-add.
I assume that the presented results depict regression after switch to
cryptogams module. Right? RvD implementation distinguishes itself in two
ways:
1. some of additions in inner loop are replaced with multiply-by-1-n-add;
...
I recall attempting 1. and chosen not to do it with following rationale.
On processor I have access to, Octeon II, it made no significant
difference. It was better, but only marginally. And it's understandable,
because Octeon II should have lesser difficulty pairing those additions
with multiply-n-add instructions. But since multiplication is an
expensive operation, it can be pretty slow, I reckoned that on processor
less potent than Octeon II it might be more appropriate to minimize
amount of multiplication-n-add instructions.
As an example, MIPS 1004K manual discusses that that there are two
options for multiplier for this core, proper and poor-man's. Proper
multiplier unit can issue multiplication or multiplication-n-add each
cycle, with multiplication latency apparently being 4. Poor-man's unit
on the other hand can issue multiplication each 32nd[!] cycle with
corresponding latency. This means that core with poor-man's unit would
perform ~13% worse than it could have been. Updated module does use
multiply-by-1-n-add, so this note is effectively for reference in case
"poor man" wonders.
Cheers.
Thanks for this information.
I wonder how many devices do exist with the "poor man" version.
Greats,
René
[0]:
https://github.com/dot-asm/cryptogams/blob/d22ade312a7af958ec955620b0d241cf42c37feb/mips/poly1305-mips.pl#L461