Marc Glisse <marc.glisse@xxxxxxxx> writes: > On Sat, 23 Jun 2012, Dag Lem wrote: [...] > > However, gcc does not seem to allow casting between vectors of > > different lengths (why?!). > > Ugly. So in your opinion, v4sf c = (v4sf)a + b; v4sf d = __builtin_ia32_haddps(c, c); *z = (v2sf)d; is uglier than v4sf c = *((v4sf*)&a) + b; v4sf d = __builtin_ia32_haddps(c, c); __builtin_ia32_storelps(z, d); ??? I suspect that you must be referring to something else than apparent aesthetics; can you please tell me where the ugliness is hidden in the first example? > > > > The problem with this is that in the calculation of "c = a + b", gcc > > generates an intermediate instruction (vmovdqa) to store the contents > > of the YMM register holding "a" to memory, instead of directly > > accessing the low 128 bits via the corresponding XMM register. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53101 OK, so *((v4sf*)&a) is not supposed to be a no-op, but presently isn't? > > > I have also looked into using inline assembly to avoid the overhead, > > however as far as I can tell it is not possible to use YMM registers > > as parameters. > > Yes it is, you may want to re-read the doc remembering that the YMM > registers are the XMM registers (you just use them more fully). Hmm, I have both read and re-read the documentation, and I am still at a loss on how to do "a + b" in a sane way when "a" is a YMM register and "b" is an XMM register, and I want the result in an XMM register. I am able to emit something like "vaddps %ymm0, %xmm1, %xmm1", but this is not valid assembly. My initial thought was to pass "a" as an *explicit* YMM register so that I could refer to the corresponding XMM register in inline assembly, however there is no helpful machine constraint for YMM registers, only "Yz" for XMM0. > > > Is there any way to cast from a YMM to an XMM register without > > incurring any performance penalty? > > ISTR that _mm256_extractf128_ps(*,0) is properly optimized to nothing > when possible (I may misremember). I tried this now, and gcc simply replaces vextractf128 with an equally superfluous vmovaps. -- Best regards, Dag Lem