Re: prefetching on pentium 4

Tim Prince <timothyprince@xxxxxxxxxxxxx> · Tue, 05 Dec 2006 12:34:46 -0800

ranjith kumar wrote:
--- Tim Prince <timothyprince@xxxxxxxxxxxxx> wrote:

ranjith kumar wrote:
Hi,

   1) Will "gcc" insert prefetch instructions
automatically on "pentium 4" processor?
Which flags should be enabled while compiling
sothat
gcc  automatically insert prefetch instructions?

2) Or programmer has to include some functions?
   If so, what is the syntax of that function?

P4 isn't suitable for automatic compiler-generated
prefetch.  Default 
hardware prefetch (stride-based and cache line
pairs) is quite 
effective. 

 Prefetch intrinsics are available with
#include 
<xmmintrin.h>.  Details on what works vary with
steppings.  The earliest 
P4 models could accelerate hardware prefetch by the
program issuing 3 
cache lines of prefetch prior to entering a loop. 
Since Northwood, that 
doesn't work.  Since Prescott, prefetch hints are
ignored on P4, with 
prefetch going to L2 regardless of hints.  Effect of
prefetch on DTLB 
misses also is model dependent.
Contrary to what certain Windows related docs say,
_mm_prefetch() works 
the same on all compilers which implement it.

Hi,
   1) What is the difference between "prefetchnta" and

"prefetchT1" instructions in case of pentium 4
processor.
   In IA-32 optimization manual it was given that
prefetchT1,prefetchT2 and prefetchT3 are identical in
case of pentium 4 processor. Also prefetchnta fetches
the data into second level cache "minimizing cache
pollution". What does "minimizing cache pollution"
mean?
Schemes for fetching directly to L1 have generally been abandoned in 
favor of waiting until the hardware requests data.  L1 isn't big enough 
to handle extra data brought in "speculatively" with enough advance 
notice to handle L2 misses.

When I compared two programs, first one prefetching
data using "prefetchnta" and the second one using
"prefetchT0", I observed that second program was
executed faster. What could be the reason?
I don't have the experience to comment on that, and it may well depend 
on which type of P4 you have.  Maybe your data are resident in L2 and 
you have a P4 model which benefits from prefetching them into L1.
You don't even say which compiler you are using or why you wouldn't try 
vectorization if you were serious about "real-world" performance. 
Integer multiply on P4 (at least the early ones) is so slow it's hard to 
imagine much value for other optimizations.  You might do better on the 
cacheing side by reorganizing your data than by using prefetch.

p.s: Below is the program which uses "prefetchT0". To
get program which uses "prefetchnta" send 0 as second
argument to fucntion in 22 line. I run then on
pentium4 processor with fedora core operating system.

    1 #include<stdio.h>
      2 #include<xmmintrin.h>
      3 int main()
      4 {
      5
      6 int i,j,k,h;
      7         struct list
      8         {
      9                 long double
w,w1,u,u1,x,x1,y,y1,z,z1;
     10                 long double e1,e2,e3,e4;
     11                 long double b1,b2,b3,b4,b5,b6;
     12                 long int a,b,c,d,e;
     13         }l[5000];
     14
     15
     16 int total;
     17 for(h=0;h<9;h++)
     18 for(j=0;j<99999;j++)
     19 for(i=0;i<1000;i++)
     20 {
     21 //k=rand()%500;
     22 _mm_prefetch((&l[(i+2)].a),3);
     23
     24
total+=(l[i*5].a)*(l[i*5].b)*(l[i*5].c)*(l[i*5].d)*(l[i*5].e);
     25
     26 //printf("\n %d ",total);
     27 }
     28
     29 printf("\n %d ",total);
     30 }

Send instant messages to your online friends http://uk.messenger.yahoo.com