Hi, I'm optimizing an HPC application, and I'm getting a high rate of L1 instruction cache misses. I assume it's because the executable takes up quite a few mem pages and the execution keeps moving from one to another. I need a way to place the critical functions on the same few pages. I couldn't find anything like this in GCC - am I missing something? Option #1 was using attributes to mark hot or cold code. This feels like manual labor... quite a lot of code to cover and I'm not sure how to set the threshold. Even if this was done automatically - I suspect the code would still not be optimal since code can be either hot or cold without anything in the middle. Option #2 is function ordering. I read this page on the Intel compiler: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_pgo_funcdata_order.htm - I followed the instructions, and compared icc before and after this optimization to gcc performance (both in terms of instruction cache misses and in terms of overall runtime): Before optimiziation icc was worse then gcc by an order of magnitude in cache misses, but after the optimization it improved by two orders of magnitude and surpassed gcc. In terms of runtime icc went from much worse to slightly better the gcc, but I'm pretty sure I can do better with gcc with some effort. The way this works is that you build icc with profiling, run it to gather the profiling data, and then build again based on this data. The optimized build reorganizes the functions to be roughly in the order they are called in the code, which makes most critical functions be on the same page as their callers, and a few pages keep all the performance critical code. Is there a way to do the same (or similar) with gcc? I'm ready to do some coding for gcc (plugin?) if it's feasible... Thanks, Alex