Re: block-sha1: improve code on large-register-set machines

Nicolas Pitre <nico@xxxxxxx> · Tue, 11 Aug 2009 16:03:58 -0400 (EDT)

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > 
> > BLK_SHA1:	 5.280s		[original]
> > BLK_SHA1:	 7.410s		[with SMALL_REGISTER_SET defined]
> > BLK_SHA1:	 7.480s		[with 'W(x)=(val);asm("":"+m" (W(x)))']
> > BLK_SHA1:	 4.980s		[with 'W(x)=(val);asm("":::"memory")']
> > 
> > At this point the generated assembly is pretty slick.  I bet the full 
> > memory barrier might help on x86 as well.
> 
> No, I had tested that earlier - single-word memory barrier for some reason 
> gets _much_ better numbers at least on x86-64. We're talking
> 
> 	linus            1.46       418.2
> vs
> 	linus           2.004       304.6
> 
> kind of differences. With the "+m" it outperforms openssl (375-380MB/s).
> 
> The "volatile unsigned int *" cast looks pretty much like the "+m" version 
> to me, but Arthur got a speedup from whatever gcc code generation 
> differences on his P4.

The volatile pointer forces a write to memory but the cached value in 
the processor's register remains valid, whereas the "+m" forces gcc to 
assume the register copy is not valid anymore.  That certainly gives the 
compiler a different clue about register availability, etc.

> The really fundamental and basic problem with gcc on this code is that gcc 
> does not see _any_ difference what-so-ever between the five variables 
> declared with
> 
> 	unsigned int A, B, C, D, E;
> 
> and the sixteen variables declared with
> 
> 	unsigned int array[16];
> 
> and considers those all to be 21 local variables. It really seems to think 
> that they are all 100% equivalent, and gcc totally ignores me doing things 
> like adding "register" to the A-E ones etc.
> 
> And if you are a compiler, and think that the routine has 21 equal 
> register variables, you're going to do crazy reload sh*t when you have 
> only 7 (or 15) GP registers. So doing that full memory barrier seems to 
> just take that random situation, and force some random variable to be 
> spilled (this is all from looking at the generated code, not from looking 
> at gcc).
> 
> In contrast, with the _targeted_ thing ("you'd better write back into 
> array[]") we force gcc to spill the array[16] values, and not the A-E 
> ones, and that's why it seems to make such a big difference.
> 
> And no, I'm not sure why ARM apparently doesn't show the same behavior. Or 
> maybe it does, but with an in-order core it doesn't matter as much which 
> registers you keep reloading - you'll be serialized all the time _anyway_. 

Well... gcc is really strange in this case (and similar other ones) with 
ARM compilation.  A good indicator of the quality of the code is the 
size of the stack frame.  When using the "+m" then gcc creates a 816 
byte stack frame, the generated binary grows by approx 3000 bytes, and 
performances is almost halved (7.600s).  Looking at the assembly result 
I just can't figure out all the crazy moves taking place.  Even the 
version with no barrier what so ever produces better assembly with a 
stack frame of 560 bytes.

The volatile version is second to worst with a 808 byte stack frame with 
similar bad performances.

With the full "memory" the stack frame shrinks to 280 bytes and best 
performances so far is obtained.  And none of the A B C D E or data 
variables are ever spilled onto the stack either, only the array[16] 
gets allocated stack slots, and the TEMP variable.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html