Re: cache optimization

Tim Prince <n8tm@xxxxxxx> · Sat, 28 Nov 2009 12:51:39 -0800

£ukasz wrote:

--- On Thu, 11/26/09, Tim Prince <n8tm@xxxxxxx> wrote:

From: Tim Prince <n8tm@xxxxxxx>
Subject: Re: cache optimization
To: "£ukasz" <blurrpp@xxxxxxxxx>
Cc: gcc-help@xxxxxxxxxxx
Date: Thursday, November 26, 2009, 4:38 PM
£ukasz wrote:
Hi I want to learn how to optimaze cache usage in gcc.
I find builtin function __builtin_prefetch which should
prefetch datas to cache .. so i use cannonical :) example of
vector addition.
for (i = 0; i < n; i++)
   {
     a[i] = a[i] + b[i];
     __builtin_prefetch
(&a[i+1], 1, 1);
     __builtin_prefetch
(&b[i+1], 0, 1);
     /* ... */
   }

and compile it with gcc without special options ....
but its slower than
for (i = 0; i < n; i++)
   {
     a[i] = a[i] + b[i];
     /* ... */
   }

so maybe I should compile it with soem extra options
to have advantage of cache prefatching
?(-fprefetch-loop-array doenst works )

Under normal settings, on CPUs of the last 6 years or so,
you are prefetching what has already been prefetched by
hardware prefetcher.  If your search engine doesn't
find you many success stories about the use of this feature,
that might be a clue that it involves some serious
investigation. You would look for slow spots in your code
which don't fall in the usual hardware supported prefetch
patterns (linear access with not too large a stride, or
pairs of cache lines), and experiment with fetching the data
sufficiently far in advance for it to do some good, without
exceeding your cache capacity.
I do see a "success story" about prefetching for a reversed
loop. As the author doesn't divulge the CPU in use, one
suspects it might be something like the old Athlon32 which
supported hardware prefetch only in the forward direction.
Don't you like advice which assumes no one will ever use a
CPU different (e.g. more up to date) than the author's
favorite?

You are completely right, in this example gcc compiler change code to branch assembly, which ofcourse is already "predicted" ( means forward NOT TAKEN, backward TAKEN), but im looking for some nice example for modern procesors which would realy works I mean speed program(im searching net actualy ). In Intel Optimization Reference Manual they advice to use PREFETCH to any predicable memory access patern, but as you already mensioned some patern processor can predict by him self.

Prefetch intrinsics might be effective in a case such as

for(i=0; i<n; ++i)
   a[indx[i]] += b[indx[i]];

but, as another responder said, you must recognize that you will be 
fetching whole cache lines, or pairs of cache lines, and may need half a 
dozen in advance.  If you are running on an Atom (non-out-of-order), you 
might expect entirely different results, and different optimum strategy, 
from a CPU with a large out-of-order queue.