Re: Matrix multiplication: performance drop

"John (Eljay) Love-Jensen" <eljay@xxxxxxxxx> · Tue, 3 Mar 2009 08:34:50 -0800

Hi Yury,

> So for N multiple of 512 there is very strong drop of performance.
> The question is - why and how to avoid it ?

I presume the 'why' is because the cache is trashing.

You can avoid it by using a cache friendly calculation.

> In fact, there is a cache friendly solution, namely, to interchange the
> inner loops as:
> 
> for ( i = 0; i < N; i++ )
>  for ( k = 0; k < N; k++ )
>   for ( j = 0; j < N; j++ )
>    c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];

That solution worked for me.  A x5 improvement in general, and a x40
improvement in the cache thrash multiple of 512 case.

Even more improvement is possible by walking memory in such a way as to keep
the cache as fresh as possible as long as possible.  (I think ACM has many
articles on how to walk memory for optimal cache use.)

> but a question remains -
> why an unoptimized code works fine, say, for N = 2100 ,
> but doesn't work for N = 2048, or, in general, for N multiply of 512?
> What is a magic number ?

The unoptimized code did not work fine for me.  It was 1/5th the speed of
the improved cache friendly solution, for the general case.  (Dual 2.7 GHz
PowerPC G5.)

The take-away item is that compiler optimization -O3 is no substitute for
optimal algorithms.  (And optimal algorithms cannot compensate for a bad
architecture... but that's a story for a different day.)  And that profiling
is important for identifying the routines that can benefit from being either
hand-optimized, or being massaged to help the optimizer better do its job
(even it it results in somewhat ugly code... please comment heavily for the
poor maintenance programmer).

Sincerely,
--Eljay