[4.4] Strange performance regression?

francesco biscani <bluescarni@xxxxxxxxx> · Tue, 13 Oct 2009 23:48:11 +0200

(I mistakenly sent this to gcc@xxxxxxx earlier, apologies for any inconvenience)

Hi list,

I'm experiencing a strange behaviour with GCC 4.4.1. Basically I have
some C++ mathematical code which gets a ~x2 performance drop if I
*remove* the following debug line from the code:

---------------

std::cout << "Block size: " << block_size << '\n';

---------------

Where block_size is a std::size_t variable. It took a while to bisect
this issue, but now I can reproduce it consistently by removing just
that line. In order to have the expected performance, both the strings
and the variable must be printed. The architecture is x86_64, 64-bit
Gentoo Linux on Intel Core2 Q6600 CPU. The same problem is not there
on another machine, a 64-bit Ubuntu Linux with GCC 4.3.3 and Intel
Xeon Core2 (8 cores total).

Unfortunately the offending portion of code is buried quite deep into
templated code, so it is a bit difficult for me to reduce the test
case to a minimum. However, some background may be helpful in
isolating the possible causes. That portion of the code is
conceptually quite simple. It is a polynomial multiplication routine,
which deals with two vectors of coefficients (in the specific case,
double-precision coefficients) and two vectors of long ints
representing the exponents (it's a kind of sparse representation of
two univariate polynomials). The coefficients are multiplied
one-by-one and the corresponding exponents are added one-by-one so
that the resulting integers indicate the positions of the results of
coefficient multiplication in a third coefficient vector (which
represents the result of the multiplication). In order to achieve best
performance, cache-blocking is employed in order to promote spatial
and temporal locality.

Since this portion of the code is quite critical, I've been
consistently trying to make sure the performance was always optimal.
In fact, when the code is as fast as expected, the processor is fully
utilizing its computing power, averaging around 4-5 clock cycles per
coefficient multiplication on the three different Core2 processors
I've tested the code with. This performance figure has been maintained
consistently for at least one year throughout various releases of GCC,
until this problem arose.

I've tried playing around a bit with the optimizations and with the
-falign* switches,
but I could not identify any concrete lead. The only maybe relevant
clue is that the
problem can be mitigated a bit by choosing -Os optimization level
instead of -O2. To my non-expert eyes this would seem like a case of
missed optimization (which maybe is triggered back by the print to
screen?), but at this point I am really at loss. I would like to open
a bug report, but first I wanted to understand if there is something
that I'm completely missing.

Any help or comment would be really appreciated!

Thanks,

 Francesco.

PS: if you reply, please CC me, as I'm not subscribed to the list.