Re: [PATCH v3 19/29] crypto: mips/poly1305 - incorporate OpenSSL/CRYPTOGAMS optimized implementation

Andy Polyakov <appro@xxxxxxxxxxxxxx> · Tue, 8 Oct 2019 13:38:39 +0200

Hi,

On 10/7/19 11:02 PM, René van Dorst wrote:
> Quoting Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>:
> 
>> This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305
>> implementation
>> for MIPS authored by Andy Polyakov, and contributed by him to the OpenSSL
>> project.

Formally speaking this is a little bit misleading statement. Cryptogams
poly1305-mips module implements both 64- and 32-bit code paths, while
what you'll find in OpenSSL is 64-only implementation. But in either case...

>> <snip>
> 
> Hi Ard,
> 
> Is it also an option to include my mip32r2 optimized poly1305 version?
> 
> Below the results which shows a good improvement over the Andy Polyakov
> version.
> I swapped the poly1305 assembly file and rename the function to
> <func_name>_mips
> Full WireGuard source with the changes [0]
> 
> bytes |  RvD | openssl | delta | delta / openssl
>  ...
>  4096 | 9160 | 11755   | -2595 | -22,08%

I assume that the presented results depict regression after switch to
cryptogams module. Right? RvD implementation distinguishes itself in two
ways:

1. some of additions in inner loop are replaced with multiply-by-1-n-add;
2. carry chain at the end of the inner loop is effectively fused with
beginning of the said loop/taken out of the loop.

I recall attempting 1. and chosen not to do it with following rationale.
On processor I have access to, Octeon II, it made no significant
difference. It was better, but only marginally. And it's understandable,
because Octeon II should have lesser difficulty pairing those additions
with multiply-n-add instructions. But since multiplication is an
expensive operation, it can be pretty slow, I reckoned that on processor
less potent than Octeon II it might be more appropriate to minimize
amount of multiplication-n-add instructions. In other words idea is not
(and never has been) to get fixated on specific processor at hand, but
try to find a sensible compromise that would produce reasonable
performance on a range of processors. Of course problem is that it's
just an assumption I made here, and it could turn wrong in practice:-)
So I wonder which processor do you run on, René? For reference I measure
>70MB/sec for 1KB blocks for chacha20poly1305 on 1GHz Octeon II. You
report ~34MB/sec, so it ought to be something different. Given second
data point it might be appropriate to reconsider and settle for
multiply-by-1-n-add.

As for 2. I haven't considered it. Since it's a back-to-back dependency
chain, if fused with top of the loop, it actually has more promising
potential than 1. And it would improve all results, not only MISP32R2.
Would you trust me with adopting it to my module? Naturally with due credit.

Cheers.