Re: AVX - cast from YMM to XMM without overhead?

Dag Lem <dag@xxxxxxxxx> · 24 Jun 2012 01:14:14 +0200

Marc Glisse <marc.glisse@xxxxxxxx> writes:

> On Sat, 23 Jun 2012, Dag Lem wrote:

[...]

> > However, gcc does not seem to allow casting between vectors of
> > different lengths (why?!).
> 
> Ugly.

So in your opinion,

  v4sf c = (v4sf)a + b;
  v4sf d = __builtin_ia32_haddps(c, c);
  *z = (v2sf)d;

is uglier than

  v4sf c = *((v4sf*)&a) + b;
  v4sf d = __builtin_ia32_haddps(c, c);
  __builtin_ia32_storelps(z, d);

???

I suspect that you must be referring to something else than apparent
aesthetics; can you please tell me where the ugliness is hidden in the
first example?

> >
> > The problem with this is that in the calculation of "c = a + b", gcc
> > generates an intermediate instruction (vmovdqa) to store the contents
> > of the YMM register holding "a" to memory, instead of directly
> > accessing the low 128 bits via the corresponding XMM register.
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53101

OK, so *((v4sf*)&a) is not supposed to be a no-op, but presently
isn't?

> 
> > I have also looked into using inline assembly to avoid the overhead,
> > however as far as I can tell it is not possible to use YMM registers
> > as parameters.
> 
> Yes it is, you may want to re-read the doc remembering that the YMM
> registers are the XMM registers (you just use them more fully).

Hmm, I have both read and re-read the documentation, and I am still at
a loss on how to do "a + b" in a sane way when "a" is a YMM register
and "b" is an XMM register, and I want the result in an XMM
register. I am able to emit something like "vaddps %ymm0, %xmm1,
%xmm1", but this is not valid assembly. My initial thought was to pass
"a" as an *explicit* YMM register so that I could refer to the
corresponding XMM register in inline assembly, however there is no
helpful machine constraint for YMM registers, only "Yz" for XMM0.

> 
> > Is there any way to cast from a YMM to an XMM register without
> > incurring any performance penalty?
> 
> ISTR that _mm256_extractf128_ps(*,0) is properly optimized to nothing
> when possible (I may misremember).

I tried this now, and gcc simply replaces vextractf128 with an equally
superfluous vmovaps.

-- 
Best regards,

Dag Lem