Re: using vector extension in gcc slows down my code

Da Zheng <zhengda1936@xxxxxxxxx> · Fri, 12 Feb 2010 15:12:34 +0800

Hi,

On 10-2-12 上午2:31, Brian Budge wrote:
> Hi Zheng -
> 
> Your code can be sped up significantly even without SSE.  Most
> important is removing the modulo operators, which (unless gcc's
> optimizer is REALLY good), is definitely taking a lot of your time.
> (Note that doing things with SSE or SIMD in general might require a
> little bit of thinking in a different way)
> 
> You started with this:
> 
> for (v1 = 0; v1 < MATRIX_Y; v1++) {
>         for (v2 = 0; v2 < MATRIX_X; v2++) {
>             double v;
> 
>             v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2];
>             v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2];
>             v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) % MATRIX_X];
>             v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) % MATRIX_X];
>             v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 *
> MATRIX_X + v2) % 32)) & 1;
>             v *= 0.25;
>             v += in2[v1 * MATRIX_X + v2];
>             out[v1 * MATRIX_X + v2] = v;
>             out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v);
>         }
>     }
> 
> The key is to realize that the code is very simple other than at the
> boundary conditions (all around the outside of the matrix).  This can
> allow us to remove the modulos, and yields something like this:
> 
> inline void filterRow(double *oy, double *oy2, double *in2,
>                       double *ym1, double *y, double *yp1, int *b) {
>     double y_0 = *y;
>     double *yLoopEnd = ym1 + M_MATRIX-1;
> 
>     double l = *(y+MATRIX_X);
>     double r  *(y-1);
>     double d = *ym1++;
>     double u = *yp1++;
> 
>     double v = l + r + d + u;
>     v *= 0.25 * *b++;
>     v += *in2++;
>     *oy++ = v;
>     *oy2++ = fabs(*y++ - v);
> 
>     for(; ym1 != yLoopEnd; ++ym1) {
>         l = *(y-1);
>         r = *(y-1);
>         d = *ym1; //incremented in loop
>         u = *yp1++;
> 
>         v = l + r + d + u;
>         v *= 0.25 * *b++;
>         v += *in2++;
>         *oy++ = v;
>         *oy2++ = fabs(*y++ - v);
>     }
> 
>     l = *(y-1);
>     d = *ym1;
>     u = *yp1;
> 
>     v = l + y_0 + d + u;
>     v *= 0.25 * *b;
>     v += *in2;
>     *oy = v;
>     *oy2 = fabs(*y - v);
> }
> 
> double *y_0 = in;
> double *ym1 = in + (MATRIX_Y-1) * MATRIX_X;
> double *yp1 = in + MATRIX_X;
> 
> // note that I'm now assuming that "b" is bits precomputed and ready
> // to use at each index... exercise left to reader
> filterRow(out, out2, in2, ym1, in, yp1, b);
> 
> for(v1 = 0; v1 < MATRIX_Y-1; ++v1) {
>     out += MATRIX_X;
>     out2 += MATRIX_X;
>     in2 += MATRIX_X;
>     ym1 = in;
>     in = yp1;
>     yp1 += MATRIX_X;
>     b += MATRIX_X;
>     filterRow(out, out2, in2, ym1, in, yp1, b);
>  }
> 
> filterRow(out+MATRIX_X, out2+MATRIX_X, in2+MATRIX_X, in, yp1, yp1+MATRIX_X,
>           b+MATRIX_X);
> 
> This code (which is untested and hasn't even been compiled, so don't
> expect it to work without some effort) should blow the socks off the
> previous version.  Note with the new layout how easy it would be to
> load two adjacent elements in the inner loop.  In fact, the compiler
> may be able to vectorize this code without any effort from you.
> 
> This should get you started, but I don't want to go and give the whole
> answer away.  Play with it for a while.  Look at _mm_loadu_pd for
> unaligned loads.  If you wanted to be clever, you could unroll the
> loop yourself and do half aligned loads and half unaligned.
I actually eliminated some % operations in the inner loop by splitting it into
three ones (the code is supposed to be generated by a compiler, so I don't
expect it to be as efficient as hand-coded), quite similar as what you did in
filterRow(), but I didn't paste it because I thought it could make the logic of
my code more difficult to understand. It's kind of strange. I don't see much
speedup in Intel dual core, but there is big speedup in AMD chips. I didn't
expect AMD and Intel chips to have so much difference.

I finally understand why there is no built-in functions for aligned loads. The code
v2df *vp1 = (v2df *) ((double *) v25.valp);
v2df v = vp1[v2 / STRIDE_LEN];
has already done the work. I tried that at the beginning, but didn't align data
and got segmentation fault. I thought I had to use some built-in function to
load data. I use aligned loads now, and do see some speedup.

Best regards,
Zheng Da