Re: [PATCH v3 19/29] crypto: mips/poly1305 - incorporate OpenSSL/CRYPTOGAMS optimized implementation

Andy Polyakov <appro@xxxxxxxxxxxxxx> · Fri, 11 Oct 2019 16:14:58 +0200

Hi,

On 10/8/19 1:38 PM, Andy Polyakov wrote:
>>> <snip>
>>
>> Hi Ard,
>>
>> Is it also an option to include my mip32r2 optimized poly1305 version?
>>
>> Below the results which shows a good improvement over the Andy Polyakov
>> version.
>> I swapped the poly1305 assembly file and rename the function to
>> <func_name>_mips
>> Full WireGuard source with the changes [0]
>>
>> bytes |  RvD | openssl | delta | delta / openssl
>>  ...
>>  4096 | 9160 | 11755   | -2595 | -22,08%

Update is pushed to cryptogams. Thanks to René for ideas, feedback and
testing! There is even a question about supporting DSP ASE, let's
discuss details off-list first.

As for multiply-by-1-n-add.

> I assume that the presented results depict regression after switch to
> cryptogams module. Right? RvD implementation distinguishes itself in two
> ways:
>
> 1. some of additions in inner loop are replaced with multiply-by-1-n-add;
> ...
>
> I recall attempting 1. and chosen not to do it with following rationale.
> On processor I have access to, Octeon II, it made no significant
> difference. It was better, but only marginally. And it's understandable,
> because Octeon II should have lesser difficulty pairing those additions
> with multiply-n-add instructions. But since multiplication is an
> expensive operation, it can be pretty slow, I reckoned that on processor
> less potent than Octeon II it might be more appropriate to minimize
> amount of multiplication-n-add instructions.

As an example, MIPS 1004K manual discusses that that there are two
options for multiplier for this core, proper and poor-man's. Proper
multiplier unit can issue multiplication or multiplication-n-add each
cycle, with multiplication latency apparently being 4. Poor-man's unit
on the other hand can issue multiplication each 32nd[!] cycle with
corresponding latency. This means that core with poor-man's unit would
perform ~13% worse than it could have been. Updated module does use
multiply-by-1-n-add, so this note is effectively for reference in case
"poor man" wonders.

Cheers.