In the program attached in the end, elements in a large(N~0.4M) double array multiply with elements in another small(n=6) double array, and the results are stored in a third larger (N*n) array. To make the problem clear, this process is repeated 1000 times. It looks that the storing part ("=" operation) takes much longer time than the multiplying part ("*" operation). You can see this from the running time with the propram only has "line 1", "line 2", or "line 3". The corresponding time is attached at the end of every line. I suspect this may be related with L1 or L2 caches. I am wondering is there anyway to speed the storing part up? Thanks! /* *gcc -O2 -o TESTSPD speedtestOut.c * */ #include <stdio.h> #include <stdlib.h> #define N_LONG (444752) #define N_SHORT 6 #define NXI3 (N_LONG*N_SHORT) void testtime(){ long i,j,k,l_arr; double tmp; double vecL[N_LONG], vecS[N_SHORT]; double * sum; sum = (double *) malloc(sizeof(double)*NXI3); for(i = 0; i < N_LONG; i++){ vecL[i] = 1.0; } for(i = 0; i < N_SHORT; i++){ vecS[i] = 2.0; } for(i = 0; i < NXI3; i++){ sum[i] = 0.0; } for(k = 0; k < 1000; k++){ for(j = 0; j < N_LONG; j++){ l_arr = j*N_SHORT; for(i = 0; i <N_SHORT ; i++){ /* line 1 *///tmp = vecS[i]*vecL[j]; //2.730u 0.058s 0:02.79 99.6% 0+0k 0+0io 69pf+0w /* line 2 */ sum[l_arr+i] += vecS[i]*vecL[j]; //28.652u 0.156s 0:28.88 99.7% 0+0k 0+0io 69pf+0w /* line 3 *///sum[l_arr+i] = vecS[i]; //25.613u 0.232s 0:25.84 100.0% 0+0k 0+0io 69pf+0w } } }/*k*/ } main () { testtime(); }