Re: using vector extension in gcc slows down my code

Da Zheng <zhengda1936@xxxxxxxxx> · Thu, 11 Feb 2010 12:33:21 +0800



On 10-2-11 上午12:13, Brian Budge wrote:
> This makes a difference because the SSE unit can do two single loads,
> an add, and a store, and it can be easily pipelined.  The ratio of
> load/store to math is not ideal, but if you consider the amount of
> work to do 2 doubles instead (4 loads, 2 adds, and 2 stores), it's
> still beneficial.  You're also using unaligned loads and stores, which
> for some architectures is very bad, and is usually less good than
> aligned loads and stores.  Moreover, in your case, it's not just the
I wanted to use unaligned loads and stores, but I cannot find the corresponding
built-in functions for them.
> loads and stores, but all the integer math to calculate array indices,
It is necessary to calculate array indices in my code because the first 4 loads
in one iteration load data from the same array. In this case, I can reduce a lot
of cache miss.
> etc... as well as using unions, which doesn't allow the results to
> remain in registers, which makes for a not-very-optimal result.  Note
I tried not to use unions, but it seems the result is even a little worse. I
don't know why.
> that if you are running on 64-bit, you are likely using SSE in the
I'm not sure of it. I think it's still 32-bit. How can I see it?
> first version of your code, but its using the scalar path (only the
> first entry of each register).
> 
> The code is pretty confusing.  If I could understand what it's doing,
> I'd write you a version using the intel SSE intrinsics (see
> emmintrin.h and friends), that has a more appropriate data layout.
> Note that I'm simply assuming that this is possible, but there may be
> some valid reason why you cannot lay your data out in a SIMD-friendly
> way.
I rewrite it to simulate what I really want to do (see the code below) and hope
it can help you understand the logic of the code. The first 4 loads are from the
same array in order to save direct memory access (by doing so, it is more likely
that the data needed is already in the cache).

#define MATRIX_X 1000
#define MATRIX_Y 1000
        double *in, *in2, *out, *out2;
        int *bits;
        int v1, v2;
        struct timeval start_time, end_time;
        int startp1, startp2, startp3, startp4;

        startp1 = 1;
        startp2 = -1;
        startp3 = 1;
        startp4 = -1;
        in = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
        in2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
        out = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
        out2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
        bits = malloc ((MATRIX_X * MATRIX_Y / 32 + 1) * sizeof (int));
        for (v1 = 0; v1 < MATRIX_Y; v1++)
        {
                for (v2 = 0; v2 < MATRIX_X; v2++)
                {
                        double v;

                        v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X +
v2];
                        v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X
+ v2];
                        v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) %
MATRIX_X];
                        v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) %
MATRIX_X];
                        v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 *
MATRIX_X + v2) % 32)) & 1;
                        v *= 0.25;
                        v += in2[v1 * MATRIX_X + v2];
                        out[v1 * MATRIX_X + v2] = v;
                        out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v);
                }
        }
I'll really appreciate it if you could write a version using SSE and teach me
how to have better data layout.

Best regards,
Zheng Da