£ukasz wrote:
Hi I want to learn how to optimaze cache usage in gcc. I find builtin function __builtin_prefetch which should prefetch datas to cache .. so i use cannonical :) example of vector addition.
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+1], 1, 1);
__builtin_prefetch (&b[i+1], 0, 1);
/* ... */
}
and compile it with gcc without special options .... but its slower than
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
/* ... */
}
so maybe I should compile it with soem extra options to have advantage of cache prefatching ?(-fprefetch-loop-array doenst works )
Under normal settings, on CPUs of the last 6 years or so, you are
prefetching what has already been prefetched by hardware prefetcher. If
your search engine doesn't find you many success stories about the use
of this feature, that might be a clue that it involves some serious
investigation. You would look for slow spots in your code which don't
fall in the usual hardware supported prefetch patterns (linear access
with not too large a stride, or pairs of cache lines), and experiment
with fetching the data sufficiently far in advance for it to do some
good, without exceeding your cache capacity.
I do see a "success story" about prefetching for a reversed loop. As the
author doesn't divulge the CPU in use, one suspects it might be
something like the old Athlon32 which supported hardware prefetch only
in the forward direction. Don't you like advice which assumes no one
will ever use a CPU different (e.g. more up to date) than the author's
favorite?