Question about first scheduling pass

Bradley Lucier <lucier@xxxxxxxxxxxxxxx> · Mon, 09 Nov 2009 22:26:33 -0500

I find that enabling scheduling before register allocation on x86-64 on
my codes often results in about a 10% increase in performance, so I'd
like to use it more often.

The pre-register-allocation scheduling pass often takes a lot longer
than the post-register-allocation pass on some of my program-generated C
codes, for example

 scheduling            :  69.33 (49%) usr   0.07 ( 3%) sys  85.07 (51%) wall    1954 kB ( 1%) ggc
 scheduling 2          :   0.63 ( 0%) usr   0.00 ( 0%) sys   0.63 ( 0%) wall     357 kB ( 0%) ggc
 TOTAL                 : 140.61             2.76           166.98             238286 kB

This code was compiled with

/pkgs/gcc-mainline-mem-stats/bin/gcc -march=core2 -msse4 -O3 -fschedule-insns -fmem-report -ftime-report -Wno-unused -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp   -I"../include" -c -o "_io.o" -I. -DHAVE_CONFIG_H -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.5.3\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -D___CONFIGURE_COMMAND="\"./configure CC=/pkgs/gcc-mainline/bin/gcc -march=core2 -msse4 -O3 -fschedule-insns --enable-multiple-versions --enable-single-host\"" -D___OBJ_EXTENSION="\".o\"" -D___EXE_EXTENSION="\"\"" -D___PRIMAL _io.c -D___LIBRARY 2> _io.out

with the compiler:

heine:~/programs/gcc/mainline/gcc> /pkgs/gcc-mainline-mem-stats/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/pkgs/gcc-mainline-mem-stats/bin/gcc
COLLECT_LTO_WRAPPER=/pkgs/gcc-mainline-mem-stats/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline-mem-stats --enable-languages=c,c++ --enable-gather-detailed-mem-stats -enable-stage1-languages=c,c++
Thread model: posix
gcc version 4.5.0 20091109 (experimental) [trunk revision 154037] (GCC) 

So, pre-register-allocation takes about 1/2 the CPU time of the entire
compile.

I've been trying to figure out why the first scheduling pass takes so
much longer than the second.  (In fact, I've asked this question in one
PR, but I can't find that PR right now.)  In the file sched-rgn.c I
found

        /* This pass implements list scheduling within basic blocks.  It is
           run twice: (1) after flow analysis, but before register allocation,
           and (2) after register allocation.

           The first run performs interblock scheduling, moving insns between
           different blocks in the same "region", and the second runs only
           basic block scheduling.

So I understand from this that the two scheduling passes are doing two
different things, so it makes sense that they take dramatically
different amounts of time.

What I'd like to know is whether there's a way to modify the first
scheduling pass to be more like the second and then see whether I get
similar speedups to what I'm getting now.  Perhaps the interblock
scheduling is really what's giving me the speedup, and perhaps not.

As a hack, could I just change

      NEXT_PASS (pass_sched);

in passes.c to

      NEXT_PASS (pass_sched2);

Or should I change the definitions of pass_sched and pass_sched2 in
sched-rgn.c?

Also, there are a number of sched*.c files; are there types of
scheduling other than basic-block scheduling and inter-block scheduling
that I could try?

I suppose that if simple basic-block scheduling works well in the first
scheduling pass for certain types of codes, perhaps there could be a
compiler option that allows people to choose it.

Brad