Re: block-sha1: improve code on large-register-set machines

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 11 Aug 2009 08:43:21 -0700 (PDT)

On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> 
> BLK_SHA1:	 5.280s		[original]
> BLK_SHA1:	 7.410s		[with SMALL_REGISTER_SET defined]
> BLK_SHA1:	 7.480s		[with 'W(x)=(val);asm("":"+m" (W(x)))']
> BLK_SHA1:	 4.980s		[with 'W(x)=(val);asm("":::"memory")']
> 
> At this point the generated assembly is pretty slick.  I bet the full 
> memory barrier might help on x86 as well.

No, I had tested that earlier - single-word memory barrier for some reason 
gets _much_ better numbers at least on x86-64. We're talking

	linus            1.46       418.2
vs
	linus           2.004       304.6

kind of differences. With the "+m" it outperforms openssl (375-380MB/s).

The "volatile unsigned int *" cast looks pretty much like the "+m" version 
to me, but Arthur got a speedup from whatever gcc code generation 
differences on his P4.

The really fundamental and basic problem with gcc on this code is that gcc 
does not see _any_ difference what-so-ever between the five variables 
declared with

	unsigned int A, B, C, D, E;

and the sixteen variables declared with

	unsigned int array[16];

and considers those all to be 21 local variables. It really seems to think 
that they are all 100% equivalent, and gcc totally ignores me doing things 
like adding "register" to the A-E ones etc.

And if you are a compiler, and think that the routine has 21 equal 
register variables, you're going to do crazy reload sh*t when you have 
only 7 (or 15) GP registers. So doing that full memory barrier seems to 
just take that random situation, and force some random variable to be 
spilled (this is all from looking at the generated code, not from looking 
at gcc).

In contrast, with the _targeted_ thing ("you'd better write back into 
array[]") we force gcc to spill the array[16] values, and not the A-E 
ones, and that's why it seems to make such a big difference.

And no, I'm not sure why ARM apparently doesn't show the same behavior. Or 
maybe it does, but with an in-order core it doesn't matter as much which 
registers you keep reloading - you'll be serialized all the time _anyway_. 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html