Re: block-sha1: improve code on large-register-set machines

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 11 Aug 2009 16:14:19 -0700 (PDT)

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > 
> > Well... gcc is really strange in this case (and similar other ones) with 
> > ARM compilation.  A good indicator of the quality of the code is the 
> > size of the stack frame.  When using the "+m" then gcc creates a 816 
> > byte stack frame, the generated binary grows by approx 3000 bytes, and 
> > performances is almost halved (7.600s).  Looking at the assembly result 
> > I just can't figure out all the crazy moves taking place.  Even the 
> > version with no barrier what so ever produces better assembly with a 
> > stack frame of 560 bytes.
> 
> Ok, that's just crazy. That function has a required stack size of exactly 
> 64 bytes, and anything more than that is just spilling. And if you end up 
> with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ 
> spilling.

Btw, what I think happens is:

 - gcc turns all those array accesses into pseudo's 

   So now the 'array[16]' is seen as just another 16 variables rather than 
   an array.

 - gcc then turns it into SSA, where each assignment basically creates a 
   new variable. So the 16 array variables (and 5 regular variables) are 
   now expanded to 80 SSA asignments (one array assignment per SHA1 round) 
   plus an additional 2 assignments to the "regular" variables per round 
   (B and E are changed each round). So in SSA form, you actually end up 
   having 240 pseudo's associated with the actual variables. Plus all 
   the temporaries of course.

 - the thing then goes crazy and tries to generate great code from that 
   internal SSA model. And since there are never more than ~25 things 
   _live_ at any particular point, it works fine with lots of registers, 
   but on small-register machines gcc just goes crazy and has to spill. 
   And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it 
   has created - hundreds of them.

 - End result: if the spill code doesn't share slots, it's going to create 
   a totally unholy mess of crap.

That's what the whole 'volatile unsigned int *' game tried to avoid. But 
it really sounds like it's not working too well for you. And the _big_ 
memory barrier ends up helping just because with that in place, you end up 
being almost entirely unable to schedule _anything_ between the different 
SHA rounds, so you end up with only six or seven variables "live" in 
between those barriers, and the stupid register allocator/spill logic 
doesn't break down too badly.

The thing is, if you do full memory barriers, then you're probably better 
off making both the loads and the stores be "volatile". That should have 
similar effects.

The downside with that is that it really limits the loads. So (like the 
full memory barrier) it's a big hammer approach. But it probably generates 
better code for you, because it avoids the mental breakdown of gcc 
spilling its pseudo's.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html