In typical applications the execution time is concentrated in critical
inner loops that represent a tiny fraction of total code size. Even
with the expansion in code size from -O3 such loops are likely to fit in
L1 cache and thus get faster as a result of fairly aggressive time over
space decisions in the optimizer.
Because of cache effects, almost all the code in a project will get
slower as a result of time over space optimization choices. But making
the critical inner loops faster may make more difference in total
execution speed, more than balancing making everything else slower.
I'd be much happier if the optimizer had some choices to be more cache
conscious (not exactly choose space over speed, but choose speed with
the understanding that misses in the instruction cache are likely, so
smaller code will execute faster). Even with such options, the coder
(or profile guided optimization, if you believe in that) must
somehow tag the critical loops where cache misses won't dominate
the performance).
Rainer Gerhards wrote:
I am wondering which optimization options bring offer me the best
runtime performance (speed of execution) on modern hardware.
The traditional thinking of time vs. space optimization is no longer
true due to CPU caches. Often, smaller code is more runtime efficient,
because the cache hit rates are much higher and that outweighs the
negative effects of jumps.