Ben Cheng wrote:
Well, I guess the real question is how to make gcc schedule better code
if loop unrolling is enabled?
My original code is actually
for (i = 0; i < 4096; i++) {
g[i] = h[i] + 10;
}
After gcc unrolls the loop, the loop bodies from different iterations
aren't overlapping with each other because the load from later
iterations is not scheduled across earlier stores. I thought this might
be due to phase ordering issues of optimization stages so I manually
unroll the loop. But unfortunately I still cannot get gcc to schedule
loads/stores more aggressively.
Since I want gcc to unroll the loop for me, I cannot create temporaries
for h[i]. Therefore I am still hoping for some magic command line
options to make gcc produce better scheduling.
There is no such magic option. The problem is not in the scheduler
itself. It can be done when/if we have more accurate aliasing info on
rtl level.
Another problem is that even if we have more accurate alias analysis, it
might be still impossible to move ld/st after RA worked. Insn
scheduling before RA is switched off for x86, x86_64 because of a bug
which finally occurs in reload when the reload can not find a hard
register for an insn operand. To get rid off this bug, insn scheduler
should be register pressure sensitive.
Also It is better to use software pipelining for this loop. You can try
-fmodulo-sched and see what happens. It might work.