I have a routine that normally completes in just under 3 us. The first time through, however, it takes over 18 us. I have found that this is due to calling a few math library functions: tanhf, atan2f, hypotf, and fmod. Subsequent calls are virtually instant. I've tried putting __attribute__((optimize("prefetch-loop-arrays"))) on the outer function, but this isn't much help (which would stand to reason, since it's not an issue of caching the data, but caching the function.) Is it at all possible to use a magic option or builtin that pre-caches the few library functions that I use? It's important for my application to reduce the gap of the first cycle time.