Best runtime optimization -O2?

Rainer Gerhards <rgerhards@xxxxxxxxx> · Thu, 29 Jan 2009 18:05:40 +0100

Hi there,

I am wondering which optimization options bring offer me the best
runtime performance (speed of execution) on modern hardware.

The traditional thinking of time vs. space optimization is no longer
true due to CPU caches. Often, smaller code is more runtime efficient,
because the cache hit rates are much higher and that outweighs the
negative effects of jumps.

I am working on rsyslog[1], a GPLed enhanced syslogd replacement. As
part of the system infrastructure, I would like to offer maximum
speed. So I have begun to dig a bit into the the doc (I am by far NO
gcc expert user). One of my project user's suggested to use -Os
because that size reduction would probably result in faster execution
(due to the caches).

However, I see that -Os does not properly align some objects, so I
think this can actually result in far worse cache performance. On the
other hand -O3 does things like loop unrolling, which I would consider
counter-productive on today's CPU/cache subsystems. I came to the
conclusion that -O2 actually offers the best runtime performance, to
be improved only by hand-picking the optimization options.

I would appreciate feedback on this issue. Is my conclusion correct?
Am I overlooking something (maybe something obvious)? My ultimate goal
is to modify the rsyslog build process so that we get the best runtime
performance and I also intend to write some doc telling expert users
how to best tweak it for those high end systems. As such, I would
really appreciate any help in gaining an in-depth understanding.
Referal's to links are quite welcome. My googeling unfortunately did
not bring up something I considered relevant (in the gcc context).

Many thanks,
Rainer Gerhards

[1] http://www.rsyslog.com