On Sat, 16 Jan 2010, Thomas Witzel wrote:
I'm stuck on some silly issue and I'm hoping there is a simple solution to it. I have a piece of code that does nothing but performing a very large number of products between std::complex<float> values and some float values in a loop. Using gcc-4.1.2 and gcc-4.2.4 my standard test case runs for about 7:25 minutes and 6:50 minutes on 3.0Ghz Penryn CPUs (single-threaded), however when using gcc-4.3.4 or gcc-4.4.2 or even the svn version, my run-time is > 40 minutes, which is a serious drop in performance. For this test I reduced all compiler options down to -O3 only. Now, I looked a bit at the assembly code produced, and there is two things that are apparent, the gcc-4.3 and newer versions produce assembly code about twice as long as the older gcc versions. Also, gcc-4.1 and 4.2 write out all the multiplications in sse code, while the 4.3 and newer call a routine named __mulsc3. Has anybody ever encountered such a performance drop and knows whether there is a compiler flag or something to get my performance back ?
Try -ffast-math (there may be less aggressive flags but that's the direction to look into). To perfectly respect the standard definition of complex multiplication, one has to jump through hoops...
Now even with -ffast-math, I am surprised to see that float*complex generates 4 multiplications, you could look trough bugzilla to see if there is anything about what looks like a missed optimization.
-- Marc Glisse