Well that was certainly a significant help, and super easy to do. And your assumption that I'd rather trade startup slowness for runtime efficiency is 100% correct, so thanks! First cycle without LD_BIND_NOW: 18us First cycle with LD_BIND_NOW: 12us First cycle with those 3 math calls commented out, LD_BIND_NOW on or off: 6us Second and subsequent cycles regardless of above settings: 3us (So in other words, even in the worst case first cycle, the second and subsequent cycles are always 3us) Are there any more tricks like that to get some more uniform distribution on that first cycle? On Mon, Apr 11, 2016 at 2:09 AM, Matthias Pfaller <leo@xxxxxxxx> wrote: > On 04/11/2016 05:17 AM, NightStrike wrote: >> I have a routine that normally completes in just under 3 us. The >> first time through, however, it takes over 18 us. I have found that >> this is due to calling a few math library functions: tanhf, atan2f, >> hypotf, and fmod. Subsequent calls are virtually instant. >> >> I've tried putting __attribute__((optimize("prefetch-loop-arrays"))) >> on the outer function, but this isn't much help (which would stand to >> reason, since it's not an issue of caching the data, but caching the >> function.) Is it at all possible to use a magic option or builtin >> that pre-caches the few library functions that I use? It's important >> for my application to reduce the gap of the first cycle time. > > I'm not quite sure if I correctly understand your problem. But if you > are talking about the time it takes to resolve the math functions from > libm you might try to set the environment variable "LD_BIND_NOW" to 1 > (see man ld.so). That way all external symobls get resolved at startup > (which will be slower) instead of on demand when the unresolved function > gets called. > > Matthias