On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > > > > > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > > > Well... gcc is really strange in this case (and similar other ones) with > > > ARM compilation. A good indicator of the quality of the code is the > > > size of the stack frame. When using the "+m" then gcc creates a 816 > > > byte stack frame, the generated binary grows by approx 3000 bytes, and > > > performances is almost halved (7.600s). Looking at the assembly result > > > I just can't figure out all the crazy moves taking place. Even the > > > version with no barrier what so ever produces better assembly with a > > > stack frame of 560 bytes. > > > > Ok, that's just crazy. That function has a required stack size of exactly > > 64 bytes, and anything more than that is just spilling. And if you end up > > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ > > spilling. > > Btw, what I think happens is: > > - gcc turns all those array accesses into pseudo's > > So now the 'array[16]' is seen as just another 16 variables rather than > an array. > > - gcc then turns it into SSA, where each assignment basically creates a > new variable. So the 16 array variables (and 5 regular variables) are > now expanded to 80 SSA asignments (one array assignment per SHA1 round) > plus an additional 2 assignments to the "regular" variables per round > (B and E are changed each round). So in SSA form, you actually end up > having 240 pseudo's associated with the actual variables. Plus all > the temporaries of course. > > - the thing then goes crazy and tries to generate great code from that > internal SSA model. And since there are never more than ~25 things > _live_ at any particular point, it works fine with lots of registers, > but on small-register machines gcc just goes crazy and has to spill. > And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it > has created - hundreds of them. > > - End result: if the spill code doesn't share slots, it's going to create > a totally unholy mess of crap. > > That's what the whole 'volatile unsigned int *' game tried to avoid. But > it really sounds like it's not working too well for you. And the _big_ > memory barrier ends up helping just because with that in place, you end up > being almost entirely unable to schedule _anything_ between the different > SHA rounds, so you end up with only six or seven variables "live" in > between those barriers, and the stupid register allocator/spill logic > doesn't break down too badly. > > The thing is, if you do full memory barriers, then you're probably better > off making both the loads and the stores be "volatile". That should have > similar effects. If the loads are volatile then gcc has less flexibility when scheduling them. > The downside with that is that it really limits the loads. So (like the > full memory barrier) it's a big hammer approach. But it probably generates > better code for you, because it avoids the mental breakdown of gcc > spilling its pseudo's. Actually, all my previous tests were done with gcc-4.3.2. I now have installed Fedora 11 which has gcc-4.4.0. And now the stack frame is a nice 64 bytes. ;-) That's with the "memory" though. With the volatile, stack frame goes up to 224 bytes and performance, although not as bad as before, is like 5.160s instead of 4.410s. The "+m" version is not much better: 208 byte stack frame and similar performance. The version with no barrier what so ever runs in 4.580s and uses a 88 byte stack frame. The generated assembly contains stupid things, but this is still the second best version, even better than the "+m" and volatile ptr ones. Conclusion: the full "memory" barrier remains the best choice on ARM. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html