Thanks for the hints. I really appreciate the advice, and this access to the secrets of the GCC initiate. Hope you don't mind if I ask some follow-up questions. > $ for i in *.s; do echo -n "${i}: "; grep -F -e memcpy ${i} | wc --lines; done Yup: we had also noticed the zillions of calls to memcpy with static arguments. This is part of what I meant by "unnecessary data shuffling". Is there some way to tell GCC that it isn't worth calling memcpy to copy such short structures? If GCC did the copying using explicit assembly code, it would probably be able to notice a host of shuffling reduction opportunities using things like peephole optimization. At least, that was our impression from looking at the generated code. > The way you are using structures forces GCC to copy data around. > ... Change structures into scalar variables ... GCC has more > freedom to place scalar variables than structures. Some copying is of course unavoidable, especially at procedure call boundaries. But none of the structures are on the heap (no malloc) and we never take any addresses, so in theory they could be held in registers and kept in fragmented representations and that sort of thing. I do not know if GCC can do that, or if there's any way to tell it to. We could re-jigger our back end to generate FORTRAN instead of C and use GCC's FORTRAN stuff, maybe that would help? > Unless you somehow manage to inline the whole program into main(), I > don't see how it can be any different. That would certainly be ideal! (Modulo cycles in the call graph that contain a non-tail-recursive link; I do not believe there is any such cycle in this particular hunk of C code.) Perhaps there some way to tell GCC that when I declare a procedure "inline", I really mean it? As you can see, our compiler goes to some trouble to mark which procedures it thinks should always be inlined versus which should be the C compiler's judgement call. > You will definitely want a lot of inlining for this sort of code, so > at least use -O3, but perhaps play with the inlining parameters too. Right; -O3 didn't make any qualitative difference. (I certainly tried that before posting.) I do see a whole bunch of inline-related parameters in the GCC documentation, but it is not clear which I should tweaked. I tried -O3 -flinline-limit=60000 (default 600) but even that doesn't make any qualitative difference. Cheers & Thanks, --Barak.