Hi Tim thanks for the reply. On Wed, Oct 14, 2009 at 12:27 AM, Tim Prince <n8tm@xxxxxxx> wrote: > Ian Lance Taylor wrote: >> >> In my experience, a performance drop in a tight loop when you remove a >> line of code means that your loop is extremely sensitive to cache line >> boundaries. It can be difficult to find the optimal code other than >> by testing various command line options. Options to particularly test >> are -falign-loops, -falign-labels, and -falign-jumps. > > That seems useful advice. The align- options could help the hot loops fit > Loop Stream Detector criteria. If you set -funroll-loops, you may exceed > the loop size which fits LSD on older CPUs, but you would often make the LSD > unnecessary. Blast it! -funroll-loops did the trick, now the speed is again within 5% of the optimal performance. Just for the record, the flags I'm using right now are: -O2 -march=core2 -funroll-loops -fomit-frame-pointer \o/ >> >> Also, be sure that you are using a -mtune option appropriate for the >> processor on which you are running. E.g., you mention Core2, so you >> should be using -mtune=core2. > > For the 64-bit compiler, the default may be better than core2, but for > 32-bit you should be using at least -march=pentium-m. If you are using > vectorizer, -mtune=barcelona could make a difference either way. > How are you controlling which threads run on which cache, in case there are > cache sharing considerations? I've played a bit with the options and the -mtune=barcelona does seem to do a small difference. At the moment the code is single-threaded, I've been trying various approaches to parallelize it but, the algorithm being so constrained by memory bandwidth, I've yet to find a solution that gives reasonable speedup while keeping the overhead low. But, are there portable ways of controlling which threads run on which cache? Thanks again very much! Francesco.