Re: x86 SHA1: Faster than OpenSSL

Artur Skawina <art.08.09@xxxxxxxxx> · Thu, 06 Aug 2009 07:19:01 +0200




Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> #             TIME[s] SPEED[MB/s]
>> rfc3174         1.357       44.99
>> rfc3174         1.352       45.13
>> mozilla         1.509       40.44
>> mozillaas       1.133       53.87
>> linus          0.5818       104.9
>>
>> so it's more than twice as fast as the mozilla implementation.
> 
> So that's some general SHA1 benchmark you have?
> 
> I hope it tests correctness too. 

yep, sort of, i just check that all versions return the same result
when hashing some pseudorandom data.

> As to my atom testing: my poor little atom is a sad little thing, and 
> it's almost painful to benchmark that thing. But it's worth it to look at 
> how the 32-bit code compares to the openssl asm code too:
> 
>  - BLK_SHA1:
> 	real	2m27.160s
>  - OpenSSL:
> 	real	2m12.580s
>  - Mozilla-SHA1:
> 	real	3m21.836s
> 
> As expected, the hand-tuned assembly does better (and by a bigger margin). 
> Probably partly because scheduling is important when in-order, and partly 
> because gcc will have a harder time with the small register set.
> 
> But it's still a big improvement over mozilla one.
> 
> (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 
> calculation, so the SHA1 speedup is larger than you see from just th 
> enumbers)

I'll start looking at other cpus once i integrate the asm versions into
my benchmark. 

P4s really are "special". Even something as simple as this on top of your
version:

@@ -129,8 +133,8 @@
 
 #define T_20_39(t) \
        SHA_XOR(t); \
-       TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+       TEMP += SHA_ROL(A,5) + (B^C^D) + E; \
+       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1;
 
        T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
        T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -139,8 +143,8 @@
 
 #define T_40_59(t) \
        SHA_XOR(t); \
-       TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+       TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \
+       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc;
 
        T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
        T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);

saves another 10% or so:

#Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         1.403        43.5
# New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f
rfc3174         1.403       43.51
linus          0.5891       103.6
linusas        0.5337       114.4
mozilla         1.535       39.76
mozillaas       1.128       54.13


artur
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html