Re: block-sha1: improve code on large-register-set machines

Nicolas Pitre <nico@xxxxxxx> · Tue, 11 Aug 2009 22:26:58 -0400 (EDT)

On Tue, 11 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Tue, 11 Aug 2009, Linus Torvalds wrote:
> 
> > 
> > 
> > On Tue, 11 Aug 2009, Nicolas Pitre wrote:
> > > 
> > > Well... gcc is really strange in this case (and similar other ones) with 
> > > ARM compilation.  A good indicator of the quality of the code is the 
> > > size of the stack frame.  When using the "+m" then gcc creates a 816 
> > > byte stack frame, the generated binary grows by approx 3000 bytes, and 
> > > performances is almost halved (7.600s).  Looking at the assembly result 
> > > I just can't figure out all the crazy moves taking place.  Even the 
> > > version with no barrier what so ever produces better assembly with a 
> > > stack frame of 560 bytes.
> > 
> > Ok, that's just crazy. That function has a required stack size of exactly 
> > 64 bytes, and anything more than that is just spilling. And if you end up 
> > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ 
> > spilling.
> 
> Btw, what I think happens is:
> 
>  - gcc turns all those array accesses into pseudo's 
> 
>    So now the 'array[16]' is seen as just another 16 variables rather than 
>    an array.
> 
>  - gcc then turns it into SSA, where each assignment basically creates a 
>    new variable. So the 16 array variables (and 5 regular variables) are 
>    now expanded to 80 SSA asignments (one array assignment per SHA1 round) 
>    plus an additional 2 assignments to the "regular" variables per round 
>    (B and E are changed each round). So in SSA form, you actually end up 
>    having 240 pseudo's associated with the actual variables. Plus all 
>    the temporaries of course.
> 
>  - the thing then goes crazy and tries to generate great code from that 
>    internal SSA model. And since there are never more than ~25 things 
>    _live_ at any particular point, it works fine with lots of registers, 
>    but on small-register machines gcc just goes crazy and has to spill. 
>    And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it 
>    has created - hundreds of them.
> 
>  - End result: if the spill code doesn't share slots, it's going to create 
>    a totally unholy mess of crap.
> 
> That's what the whole 'volatile unsigned int *' game tried to avoid. But 
> it really sounds like it's not working too well for you. And the _big_ 
> memory barrier ends up helping just because with that in place, you end up 
> being almost entirely unable to schedule _anything_ between the different 
> SHA rounds, so you end up with only six or seven variables "live" in 
> between those barriers, and the stupid register allocator/spill logic 
> doesn't break down too badly.
> 
> The thing is, if you do full memory barriers, then you're probably better 
> off making both the loads and the stores be "volatile". That should have 
> similar effects.

If the loads are volatile then gcc has less flexibility when scheduling 
them.

> The downside with that is that it really limits the loads. So (like the 
> full memory barrier) it's a big hammer approach. But it probably generates 
> better code for you, because it avoids the mental breakdown of gcc 
> spilling its pseudo's.

Actually, all my previous tests were done with gcc-4.3.2.  I now have 
installed Fedora 11 which has gcc-4.4.0.  And now the stack frame is a 
nice 64 bytes.  ;-)

That's with the "memory" though.  With the volatile, stack frame goes up 
to 224 bytes and performance, although not as bad as before, is like 
5.160s instead of 4.410s.  The "+m" version is not much better: 208 byte 
stack frame and similar performance.

The version with no barrier what so ever runs in 4.580s and uses a 88 
byte stack frame.  The generated assembly contains stupid things, but 
this is still the second best version, even better than the "+m" and 
volatile ptr ones.

Conclusion: the full "memory" barrier remains the best choice on ARM.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html