Re: AVX - cast from YMM to XMM without overhead?

Marc Glisse <marc.glisse@xxxxxxxx> · Sun, 24 Jun 2012 16:10:21 +0200 (CEST)

On Sun, 24 Jun 2012, Dag Lem wrote:

Marc Glisse <marc.glisse@xxxxxxxx> writes:

On Sat, 23 Jun 2012, Dag Lem wrote:

[...]

However, gcc does not seem to allow casting between vectors of
different lengths (why?!).

Ugly.

So in your opinion,

 v4sf c = (v4sf)a + b;
 v4sf d = __builtin_ia32_haddps(c, c);
 *z = (v2sf)d;

is uglier than

 v4sf c = *((v4sf*)&a) + b;
 v4sf d = __builtin_ia32_haddps(c, c);
 __builtin_ia32_storelps(z, d);

???

I suspect that you must be referring to something else than apparent
aesthetics; can you please tell me where the ugliness is hidden in the
first example?

Casting a vector to a vector with a different number of elements: I am not 
sure what that is supposed to mean. Casting a pointer: I know this means 
reinterpreting what is in memory there. In C++ you could also consider 
casting to a reference to v4sf (equivalent to the pointer cast) but I am 
not sure that it is supported yet (related to bug 53121).

The problem with this is that in the calculation of "c = a + b", gcc
generates an intermediate instruction (vmovdqa) to store the contents
of the YMM register holding "a" to memory, instead of directly
accessing the low 128 bits via the corresponding XMM register.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53101

OK, so *((v4sf*)&a) is not supposed to be a no-op, but presently
isn't?

Strike the "not" in your sentence, and yes. It manages it for *(float*)&a 
(taking the first element) but not for subvectors (yet).

I have also looked into using inline assembly to avoid the overhead,
however as far as I can tell it is not possible to use YMM registers
as parameters.

Yes it is, you may want to re-read the doc remembering that the YMM
registers are the XMM registers (you just use them more fully).

Hmm, I have both read and re-read the documentation, and I am still at
a loss on how to do "a + b" in a sane way when "a" is a YMM register
and "b" is an XMM register, and I want the result in an XMM
register. I am able to emit something like "vaddps %ymm0, %xmm1,
%xmm1", but this is not valid assembly.

Ah... I see. Can't think of a nice solution right now, sorry.

Is there any way to cast from a YMM to an XMM register without
incurring any performance penalty?

ISTR that _mm256_extractf128_ps(*,0) is properly optimized to nothing
when possible (I may misremember).

I tried this now, and gcc simply replaces vextractf128 with an equally
superfluous vmovaps.

:-(
Similar builtins sometimes work. You could try to file a bug about it if 
there isn't one already.

A single optimistic note: the extra moves often have negligible cost, even 
in a tight loop of a dozen instructions.

--
Marc Glisse