Hi, On 10-2-12 上午2:31, Brian Budge wrote: > Hi Zheng - > > Your code can be sped up significantly even without SSE. Most > important is removing the modulo operators, which (unless gcc's > optimizer is REALLY good), is definitely taking a lot of your time. > (Note that doing things with SSE or SIMD in general might require a > little bit of thinking in a different way) > > You started with this: > > for (v1 = 0; v1 < MATRIX_Y; v1++) { > for (v2 = 0; v2 < MATRIX_X; v2++) { > double v; > > v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2]; > v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2]; > v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) % MATRIX_X]; > v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) % MATRIX_X]; > v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 * > MATRIX_X + v2) % 32)) & 1; > v *= 0.25; > v += in2[v1 * MATRIX_X + v2]; > out[v1 * MATRIX_X + v2] = v; > out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v); > } > } > > The key is to realize that the code is very simple other than at the > boundary conditions (all around the outside of the matrix). This can > allow us to remove the modulos, and yields something like this: > > inline void filterRow(double *oy, double *oy2, double *in2, > double *ym1, double *y, double *yp1, int *b) { > double y_0 = *y; > double *yLoopEnd = ym1 + M_MATRIX-1; > > double l = *(y+MATRIX_X); > double r *(y-1); > double d = *ym1++; > double u = *yp1++; > > double v = l + r + d + u; > v *= 0.25 * *b++; > v += *in2++; > *oy++ = v; > *oy2++ = fabs(*y++ - v); > > for(; ym1 != yLoopEnd; ++ym1) { > l = *(y-1); > r = *(y-1); > d = *ym1; //incremented in loop > u = *yp1++; > > v = l + r + d + u; > v *= 0.25 * *b++; > v += *in2++; > *oy++ = v; > *oy2++ = fabs(*y++ - v); > } > > l = *(y-1); > d = *ym1; > u = *yp1; > > v = l + y_0 + d + u; > v *= 0.25 * *b; > v += *in2; > *oy = v; > *oy2 = fabs(*y - v); > } > > double *y_0 = in; > double *ym1 = in + (MATRIX_Y-1) * MATRIX_X; > double *yp1 = in + MATRIX_X; > > // note that I'm now assuming that "b" is bits precomputed and ready > // to use at each index... exercise left to reader > filterRow(out, out2, in2, ym1, in, yp1, b); > > for(v1 = 0; v1 < MATRIX_Y-1; ++v1) { > out += MATRIX_X; > out2 += MATRIX_X; > in2 += MATRIX_X; > ym1 = in; > in = yp1; > yp1 += MATRIX_X; > b += MATRIX_X; > filterRow(out, out2, in2, ym1, in, yp1, b); > } > > filterRow(out+MATRIX_X, out2+MATRIX_X, in2+MATRIX_X, in, yp1, yp1+MATRIX_X, > b+MATRIX_X); > > This code (which is untested and hasn't even been compiled, so don't > expect it to work without some effort) should blow the socks off the > previous version. Note with the new layout how easy it would be to > load two adjacent elements in the inner loop. In fact, the compiler > may be able to vectorize this code without any effort from you. > > This should get you started, but I don't want to go and give the whole > answer away. Play with it for a while. Look at _mm_loadu_pd for > unaligned loads. If you wanted to be clever, you could unroll the > loop yourself and do half aligned loads and half unaligned. I actually eliminated some % operations in the inner loop by splitting it into three ones (the code is supposed to be generated by a compiler, so I don't expect it to be as efficient as hand-coded), quite similar as what you did in filterRow(), but I didn't paste it because I thought it could make the logic of my code more difficult to understand. It's kind of strange. I don't see much speedup in Intel dual core, but there is big speedup in AMD chips. I didn't expect AMD and Intel chips to have so much difference. I finally understand why there is no built-in functions for aligned loads. The code v2df *vp1 = (v2df *) ((double *) v25.valp); v2df v = vp1[v2 / STRIDE_LEN]; has already done the work. I tried that at the beginning, but didn't align data and got segmentation fault. I thought I had to use some built-in function to load data. I use aligned loads now, and do see some speedup. Best regards, Zheng Da