On 10-2-11 上午12:13, Brian Budge wrote: > This makes a difference because the SSE unit can do two single loads, > an add, and a store, and it can be easily pipelined. The ratio of > load/store to math is not ideal, but if you consider the amount of > work to do 2 doubles instead (4 loads, 2 adds, and 2 stores), it's > still beneficial. You're also using unaligned loads and stores, which > for some architectures is very bad, and is usually less good than > aligned loads and stores. Moreover, in your case, it's not just the I wanted to use unaligned loads and stores, but I cannot find the corresponding built-in functions for them. > loads and stores, but all the integer math to calculate array indices, It is necessary to calculate array indices in my code because the first 4 loads in one iteration load data from the same array. In this case, I can reduce a lot of cache miss. > etc... as well as using unions, which doesn't allow the results to > remain in registers, which makes for a not-very-optimal result. Note I tried not to use unions, but it seems the result is even a little worse. I don't know why. > that if you are running on 64-bit, you are likely using SSE in the I'm not sure of it. I think it's still 32-bit. How can I see it? > first version of your code, but its using the scalar path (only the > first entry of each register). > > The code is pretty confusing. If I could understand what it's doing, > I'd write you a version using the intel SSE intrinsics (see > emmintrin.h and friends), that has a more appropriate data layout. > Note that I'm simply assuming that this is possible, but there may be > some valid reason why you cannot lay your data out in a SIMD-friendly > way. I rewrite it to simulate what I really want to do (see the code below) and hope it can help you understand the logic of the code. The first 4 loads are from the same array in order to save direct memory access (by doing so, it is more likely that the data needed is already in the cache). #define MATRIX_X 1000 #define MATRIX_Y 1000 double *in, *in2, *out, *out2; int *bits; int v1, v2; struct timeval start_time, end_time; int startp1, startp2, startp3, startp4; startp1 = 1; startp2 = -1; startp3 = 1; startp4 = -1; in = malloc (MATRIX_X * MATRIX_Y * sizeof (double)); in2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double)); out = malloc (MATRIX_X * MATRIX_Y * sizeof (double)); out2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double)); bits = malloc ((MATRIX_X * MATRIX_Y / 32 + 1) * sizeof (int)); for (v1 = 0; v1 < MATRIX_Y; v1++) { for (v2 = 0; v2 < MATRIX_X; v2++) { double v; v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2]; v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2]; v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) % MATRIX_X]; v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) % MATRIX_X]; v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 * MATRIX_X + v2) % 32)) & 1; v *= 0.25; v += in2[v1 * MATRIX_X + v2]; out[v1 * MATRIX_X + v2] = v; out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v); } } I'll really appreciate it if you could write a version using SSE and teach me how to have better data layout. Best regards, Zheng Da