Hi Yury, > So for N multiple of 512 there is very strong drop of performance. > The question is - why and how to avoid it ? I presume the 'why' is because the cache is trashing. You can avoid it by using a cache friendly calculation. > In fact, there is a cache friendly solution, namely, to interchange the > inner loops as: > > for ( i = 0; i < N; i++ ) > for ( k = 0; k < N; k++ ) > for ( j = 0; j < N; j++ ) > c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ]; That solution worked for me. A x5 improvement in general, and a x40 improvement in the cache thrash multiple of 512 case. Even more improvement is possible by walking memory in such a way as to keep the cache as fresh as possible as long as possible. (I think ACM has many articles on how to walk memory for optimal cache use.) > but a question remains - > why an unoptimized code works fine, say, for N = 2100 , > but doesn't work for N = 2048, or, in general, for N multiply of 512? > What is a magic number ? The unoptimized code did not work fine for me. It was 1/5th the speed of the improved cache friendly solution, for the general case. (Dual 2.7 GHz PowerPC G5.) The take-away item is that compiler optimization -O3 is no substitute for optimal algorithms. (And optimal algorithms cannot compensate for a bad architecture... but that's a story for a different day.) And that profiling is important for identifying the routines that can benefit from being either hand-optimized, or being massaged to help the optimizer better do its job (even it it results in somewhat ugly code... please comment heavily for the poor maintenance programmer). Sincerely, --Eljay