function order optimization

alex margolin <margolin.alex@xxxxxxxxx> · Sat, 3 Aug 2013 13:47:14 +0300

Hi,

I'm optimizing an HPC application, and I'm getting a high rate of L1
instruction cache misses. I assume it's because the executable takes
up quite a few mem pages and the execution keeps moving from one to
another. I need a way to place the critical functions on the same few
pages. I couldn't find anything like this in GCC - am I missing
something?

Option #1 was using attributes to mark hot or cold code. This feels
like manual labor... quite a lot of code to cover and I'm not sure how
to set the threshold. Even if this was done automatically - I suspect
the code would still not be optimal since code can be either hot or
cold without anything in the middle.

Option #2 is function ordering. I read this page on the Intel compiler:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_pgo_funcdata_order.htm
- I followed the instructions, and compared icc before and after this
optimization to gcc performance (both in terms of instruction cache
misses and in terms of overall runtime):
Before optimiziation icc was worse then gcc by an order of magnitude
in cache misses, but after the optimization it improved by two orders
of magnitude and surpassed gcc. In terms of runtime icc went from much
worse to slightly better the gcc, but I'm pretty sure I can do better
with gcc with some effort.

The way this works is that you build icc with profiling, run it to
gather the profiling data, and then build again based on this data.
The optimized build reorganizes the functions to be roughly in the
order they are called in the code, which makes most critical functions
be on the same page as their callers, and a few pages keep all the
performance critical code.

Is there a way to do the same (or similar) with gcc?
I'm ready to do some coding for gcc (plugin?) if it's feasible...

Thanks,
Alex