I find that enabling scheduling before register allocation on x86-64 on my codes often results in about a 10% increase in performance, so I'd like to use it more often. The pre-register-allocation scheduling pass often takes a lot longer than the post-register-allocation pass on some of my program-generated C codes, for example scheduling : 69.33 (49%) usr 0.07 ( 3%) sys 85.07 (51%) wall 1954 kB ( 1%) ggc scheduling 2 : 0.63 ( 0%) usr 0.00 ( 0%) sys 0.63 ( 0%) wall 357 kB ( 0%) ggc TOTAL : 140.61 2.76 166.98 238286 kB This code was compiled with /pkgs/gcc-mainline-mem-stats/bin/gcc -march=core2 -msse4 -O3 -fschedule-insns -fmem-report -ftime-report -Wno-unused -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -I"../include" -c -o "_io.o" -I. -DHAVE_CONFIG_H -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.5.3\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -D___CONFIGURE_COMMAND="\"./configure CC=/pkgs/gcc-mainline/bin/gcc -march=core2 -msse4 -O3 -fschedule-insns --enable-multiple-versions --enable-single-host\"" -D___OBJ_EXTENSION="\".o\"" -D___EXE_EXTENSION="\"\"" -D___PRIMAL _io.c -D___LIBRARY 2> _io.out with the compiler: heine:~/programs/gcc/mainline/gcc> /pkgs/gcc-mainline-mem-stats/bin/gcc -v Using built-in specs. COLLECT_GCC=/pkgs/gcc-mainline-mem-stats/bin/gcc COLLECT_LTO_WRAPPER=/pkgs/gcc-mainline-mem-stats/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline-mem-stats --enable-languages=c,c++ --enable-gather-detailed-mem-stats -enable-stage1-languages=c,c++ Thread model: posix gcc version 4.5.0 20091109 (experimental) [trunk revision 154037] (GCC) So, pre-register-allocation takes about 1/2 the CPU time of the entire compile. I've been trying to figure out why the first scheduling pass takes so much longer than the second. (In fact, I've asked this question in one PR, but I can't find that PR right now.) In the file sched-rgn.c I found /* This pass implements list scheduling within basic blocks. It is run twice: (1) after flow analysis, but before register allocation, and (2) after register allocation. The first run performs interblock scheduling, moving insns between different blocks in the same "region", and the second runs only basic block scheduling. So I understand from this that the two scheduling passes are doing two different things, so it makes sense that they take dramatically different amounts of time. What I'd like to know is whether there's a way to modify the first scheduling pass to be more like the second and then see whether I get similar speedups to what I'm getting now. Perhaps the interblock scheduling is really what's giving me the speedup, and perhaps not. As a hack, could I just change NEXT_PASS (pass_sched); in passes.c to NEXT_PASS (pass_sched2); Or should I change the definitions of pass_sched and pass_sched2 in sched-rgn.c? Also, there are a number of sched*.c files; are there types of scheduling other than basic-block scheduling and inter-block scheduling that I could try? I suppose that if simple basic-block scheduling works well in the first scheduling pass for certain types of codes, perhaps there could be a compiler option that allows people to choose it. Brad