Is big number montogomery multiplication as optimized as it can be for ARM64 as compared to X86-64 from the latest openssl github ?
We are not seeing vmull ( or pmull/pmull2) instructions in armv8-mont.pl.
On an ARM cortex-A72 (1GHz) and E5-2620 (2.1 Ghz) we are seeing an order of 10 difference in RSA signing perf for 2048 bit keys.
Ran
openssl speed rsa2048
Here are the openssl speed numbers.
x86-64
[root@nuosrv2 openssl]# ./apps/openssl speed rsa2048
Doing 2048 bit private rsa's for 10s: 13134 2048 bit private RSA's in 9.97s
Doing 2048 bit public rsa's for 10s: 379019 2048 bit public RSA's in 9.98s
OpenSSL 1.1.1-dev xx XXX xxxx
built on: reproducible build, date unspecified
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/local/ssl\"" -DENGINESDIR="\"/usr/local/lib64/engines-1.1\"" -Wa,--noexecstack
sign verify sign/s verify/s
rsa 2048 bits 0.000759s 0.000026s 1317.4 37977.9
arm64:
[root@juno openssl]# ./apps/openssl speed rsa2048
Doing 2048 bit private rsa's for 10s: 1319 2048 bit private RSA's in 9.92s
Doing 2048 bit public rsa's for 10s: 49209 2048 bit public RSA's in 9.93s
OpenSSL 1.1.1-dev xx XXX xxxx
built on: reproducible build, date unspecified
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/local/ssl\"" -DENGINESDIR="\"/usr/local/lib/engines-1.1\"" -Wa,--noexecstack
sign verify sign/s verify/s
rsa 2048 bits 0.007521s 0.000202s 133.0 4955.6
ARM64 heavy hitters
69.70% openssl libcrypto.so.1.1 [.] __bn_sqr8x_mont
18.64% openssl libcrypto.so.1.1 [.] __bn_mul4x_mont
4.92% openssl libcrypto.so.1.1 [.] MOD_EXP_CTIME_COPY_FROM_PREBUF
1.50% openssl libcrypto.so.1.1 [.] bn_mul_add_words
x86-64 heavy hitters
30.93% openssl libcrypto.so.1.1 [.] __bn_sqrx8x_reduction
17.65% openssl libcrypto.so.1.1 [.] bn_sqrx8x_internal
12.65% openssl libcrypto.so.1.1 [.] mulx4x_internal
8.91% openssl libcrypto.so.1.1 [.] bn_mul_add_words
7.14% openssl libcrypto.so.1.1 [.] bn_mulx4x_mont
Code looks different between x86 and ARM64. Is it due to the ISA or ARM64 not yet catching up with
super efficient X86-64.
Basically are we stuck with 1:5 (if we extrapolate A72 to 2Ghz) or is there an optimal code that
we need to pick up for ARM64. I compiled openssl from github (latest).
Any pointers will be extremely helpful.
Thanks,
-vijay
-- openssl-users mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users