Re: [4.4] Strange performance regression?

francesco biscani <bluescarni@xxxxxxxxx> · Wed, 14 Oct 2009 01:20:01 +0200

Hi Tim

thanks for the reply.

On Wed, Oct 14, 2009 at 12:27 AM, Tim Prince <n8tm@xxxxxxx> wrote:
> Ian Lance Taylor wrote:
>>
>> In my experience, a performance drop in a tight loop when you remove a
>> line of code means that your loop is extremely sensitive to cache line
>> boundaries.  It can be difficult to find the optimal code other than
>> by testing various command line options.  Options to particularly test
>> are -falign-loops, -falign-labels, and -falign-jumps.
>
> That seems useful advice.  The align- options could help the hot loops fit
> Loop Stream Detector criteria.  If you set -funroll-loops, you may exceed
> the loop size which fits LSD on older CPUs, but you would often make the LSD
> unnecessary.

Blast it! -funroll-loops did the trick, now the speed is again within
5% of the optimal performance. Just for the record, the flags I'm
using right now are:

-O2 -march=core2 -funroll-loops -fomit-frame-pointer

\o/

>>
>> Also, be sure that you are using a -mtune option appropriate for the
>> processor on which you are running.  E.g., you mention Core2, so you
>> should be using -mtune=core2.
>
> For the 64-bit compiler, the default may be better than core2, but for
> 32-bit you should be using at least -march=pentium-m.  If you are using
> vectorizer, -mtune=barcelona could make a difference either way.
> How are you controlling which threads run on which cache, in case there are
> cache sharing considerations?

I've played a bit with the options and the -mtune=barcelona does seem
to do a small difference. At the moment the code is single-threaded,
I've been trying various approaches to parallelize it but, the
algorithm being so constrained by memory bandwidth, I've yet to find a
solution that gives reasonable speedup while keeping the overhead low.
But, are there portable ways of controlling which threads run on which
cache?

Thanks again very much!

  Francesco.