On Sat, 23 Jun 2012, Dag Lem wrote:
Using gcc 4.7.0, I am trying to use vector extensions and AVX builtins to sum 8 complex numbers, where the real and imaginary parts are stored in separate YMM registers, and store the result in m64. In other words, I'd like to achieve the following: v2sf* z = ...; v8sf re = ...; v8sf im = ...; *z = { sum(re[0..7]), sum(im[0..7]) }; My first attempt at this was: v8sf a = __builtin_ia32_haddps256(re, im); // iirr iirr v4sf b = __builtin_ia32_vextractf128_ps256(a, 1); v4sf c = (v4sf)a + b; // iirr v4sf d = __builtin_ia32_haddps(c, c); // irir *z = (v2sf)d; // ir However, gcc does not seem to allow casting between vectors of different lengths (why?!).
Ugly.
Hence, my second attempt is as follows: v8sf a = __builtin_ia32_haddps256(re, im); // iirr iirr v4sf b = __builtin_ia32_vextractf128_ps256(a, 1); v4sf c = *((v4sf*)&a) + b; // iirr v4sf d = __builtin_ia32_haddps(c, c); // irir __builtin_ia32_storelps(z, d); // ir The problem with this is that in the calculation of "c = a + b", gcc generates an intermediate instruction (vmovdqa) to store the contents of the YMM register holding "a" to memory, instead of directly accessing the low 128 bits via the corresponding XMM register.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53101
I have also looked into using inline assembly to avoid the overhead, however as far as I can tell it is not possible to use YMM registers as parameters.
Yes it is, you may want to re-read the doc remembering that the YMM registers are the XMM registers (you just use them more fully).
Is there any way to cast from a YMM to an XMM register without incurring any performance penalty?
ISTR that _mm256_extractf128_ps(*,0) is properly optimized to nothing when possible (I may misremember).
-- Marc Glisse