Re: [PATCH 0/7] block-sha1: improved SHA1 hashing

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 6 Aug 2009 19:23:21 -0700 (PDT)

On Thu, 6 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
> > 
> > Does this make any difference for you? For me it's the best one so far
> > (the linusas2 number clearly shows that for me the register renaming does
> > nothing; other than that the functions should be very similar)
> 
> Nope. If anything, it's bit slower, but it might be in the noise. I 
> generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the 
> 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, 
> which matches the linusas2 numbers pretty exactly.

I actually found a P4 I have access to, except that one is a Prescott.

And I can't run it in 32-bit mode, because I only have a regular user 
login, and it only has the 64-bit development environment.

But I can do the hacked-for-64bit sha1bench runs, and I tested your patch.

It's horrible.

Here's the plain "linus" baseline (ie the "Do register rotation in cpp") 
thing, with the fixed "E += TEMP .." thing):

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.648       37.03
	rfc3174         1.677        36.4
	linus          0.4018       151.9
	linusas        0.4439       137.5
	linusas2       0.4381       139.3
	mozilla        0.9587       63.66
	mozillaas      0.9434        64.7

and here it is with your patch:

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.667       36.61
	rfc3174         1.644       37.12
	linus          0.4653       131.2
	linusas        0.4412       138.3
	linusas2       0.4388       139.1
	mozilla        0.9466       64.48
	mozillaas      0.9449       64.59

(ok, so the numbers aren't horribly stable, but the "plain linus" thing 
consistently outperforms here - and underperforms with your patch).

However, note that since this is the 64-bit thing, there likely aren't any 
spill issues, but it's simply an issue of "just how did the array[] 
accesses get scheduled" etc. And since this is a Prescott (or rather 
"Xeon") P4, the shifter isn't quite as horrible as yours is. _And_ this is 
a different gcc version (4.0.3).

So the numbers aren't really all that comparable. It's more an example of 
"optimizing for P4 is futile, because you're just playing with total 
randomness". That's like a 20MB/s difference, just from moving a few ALU 
ops around a bit.

And it's entirely possible that if I had gcc-4.4 on that machine, your 
patch would magically do the right thing ;)

Sadly, that machine is just a ssh gateway, so there's no real development 
tools on it at all - no way to get good profiles etc. So I can't really 
say exactly what the problem pattern is :(

		Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html