On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > Well... gcc is really strange in this case (and similar other ones) with > > ARM compilation. A good indicator of the quality of the code is the > > size of the stack frame. When using the "+m" then gcc creates a 816 > > byte stack frame, the generated binary grows by approx 3000 bytes, and > > performances is almost halved (7.600s). Looking at the assembly result > > I just can't figure out all the crazy moves taking place. Even the > > version with no barrier what so ever produces better assembly with a > > stack frame of 560 bytes. > > Ok, that's just crazy. That function has a required stack size of exactly > 64 bytes, and anything more than that is just spilling. And if you end up > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ > spilling. Btw, what I think happens is: - gcc turns all those array accesses into pseudo's So now the 'array[16]' is seen as just another 16 variables rather than an array. - gcc then turns it into SSA, where each assignment basically creates a new variable. So the 16 array variables (and 5 regular variables) are now expanded to 80 SSA asignments (one array assignment per SHA1 round) plus an additional 2 assignments to the "regular" variables per round (B and E are changed each round). So in SSA form, you actually end up having 240 pseudo's associated with the actual variables. Plus all the temporaries of course. - the thing then goes crazy and tries to generate great code from that internal SSA model. And since there are never more than ~25 things _live_ at any particular point, it works fine with lots of registers, but on small-register machines gcc just goes crazy and has to spill. And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it has created - hundreds of them. - End result: if the spill code doesn't share slots, it's going to create a totally unholy mess of crap. That's what the whole 'volatile unsigned int *' game tried to avoid. But it really sounds like it's not working too well for you. And the _big_ memory barrier ends up helping just because with that in place, you end up being almost entirely unable to schedule _anything_ between the different SHA rounds, so you end up with only six or seven variables "live" in between those barriers, and the stupid register allocator/spill logic doesn't break down too badly. The thing is, if you do full memory barriers, then you're probably better off making both the loads and the stores be "volatile". That should have similar effects. The downside with that is that it really limits the loads. So (like the full memory barrier) it's a big hammer approach. But it probably generates better code for you, because it avoids the mental breakdown of gcc spilling its pseudo's. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html