(I mistakenly sent this to gcc@xxxxxxx earlier, apologies for any inconvenience) Hi list, I'm experiencing a strange behaviour with GCC 4.4.1. Basically I have some C++ mathematical code which gets a ~x2 performance drop if I *remove* the following debug line from the code: --------------- std::cout << "Block size: " << block_size << '\n'; --------------- Where block_size is a std::size_t variable. It took a while to bisect this issue, but now I can reproduce it consistently by removing just that line. In order to have the expected performance, both the strings and the variable must be printed. The architecture is x86_64, 64-bit Gentoo Linux on Intel Core2 Q6600 CPU. The same problem is not there on another machine, a 64-bit Ubuntu Linux with GCC 4.3.3 and Intel Xeon Core2 (8 cores total). Unfortunately the offending portion of code is buried quite deep into templated code, so it is a bit difficult for me to reduce the test case to a minimum. However, some background may be helpful in isolating the possible causes. That portion of the code is conceptually quite simple. It is a polynomial multiplication routine, which deals with two vectors of coefficients (in the specific case, double-precision coefficients) and two vectors of long ints representing the exponents (it's a kind of sparse representation of two univariate polynomials). The coefficients are multiplied one-by-one and the corresponding exponents are added one-by-one so that the resulting integers indicate the positions of the results of coefficient multiplication in a third coefficient vector (which represents the result of the multiplication). In order to achieve best performance, cache-blocking is employed in order to promote spatial and temporal locality. Since this portion of the code is quite critical, I've been consistently trying to make sure the performance was always optimal. In fact, when the code is as fast as expected, the processor is fully utilizing its computing power, averaging around 4-5 clock cycles per coefficient multiplication on the three different Core2 processors I've tested the code with. This performance figure has been maintained consistently for at least one year throughout various releases of GCC, until this problem arose. I've tried playing around a bit with the optimizations and with the -falign* switches, but I could not identify any concrete lead. The only maybe relevant clue is that the problem can be mitigated a bit by choosing -Os optimization level instead of -O2. To my non-expert eyes this would seem like a case of missed optimization (which maybe is triggered back by the print to screen?), but at this point I am really at loss. I would like to open a bug report, but first I wanted to understand if there is something that I'm completely missing. Any help or comment would be really appreciated! Thanks, Francesco. PS: if you reply, please CC me, as I'm not subscribed to the list.