Re: size value of vector_size attribute

Chris Elrod <elrodc@xxxxxxxxx> · Mon, 16 Dec 2019 12:52:44 -0500

> I perfer m256 too.  I'm already using vector_size(4*sizeof(double)) for
some calculation in 3D euclid space (only 3 elements are really used).

Then I'd expect m256 to give you good performance for a lot of it; ie, 1
operation vs 2 (1x m128 and 1x m64) or 3 (3x m64).
I wish this was a trick compilers did more often. ISPC might, but I haven't
played with it as much as I'd like.

> I want SIMD code and I don't need much indexing.  But just curious, why
indexing SIMD vectors is inefficient?

If you don't do much indexing, it should be fine. Basically, to extract an
element, compilers seem to store the vector, and then load the desired
element.

Here are a couple simple examples:
https://godbolt.org/z/4EH_Nk

I used -O3 with gcc and clang.
index1 just indexes a pointer like normal. gcc and clang both use a single
vmovsd.
index2 loads a vector, and then indexes the vector. clang is able to reduce
that into being equivalent to index1, but gcc loads the vector, re-stores
the vector elsewhere, and then vmovsd from that stored vector.
index3 takes a vector as an argument, and then indexes it. Neither compiler
generated nice looking code here. They both stored the vector, and then
vmovsd to load the single element.

The index2 example with gcc was also particularly problematic, in that the
vector wasn't totally transparent to the optimizer.
Still, not a big deal if you don't do much indexing. If the autovectorizer
vectorized your scalar code, it would still have to resort to the same
tricks (storing a vector and reloading a scalar) to extract a scalar. So
you aren't losing anything.
Meaning it's probably totally fine, or much better than fine if the SIMD is
profitable.

On Mon, Dec 16, 2019 at 10:59 AM Xi Ruoyao <xry111@xxxxxxxxxxxxxxxx> wrote:

> On 2019-12-16 08:16 -0500, Chris Elrod wrote:
> > I'm not the asked, but I would strongly prefer m256 if code could be
> > generated masking the unused lane for safe loads/stores, at least on
> > architectures where this is efficient (eg, Skylake-X).
> > This automatic masking would make writing SIMD code easier when you
> > don't
> > have powers of 2, by saving the effort of passing the bitmask to each
> > operation (which is at least an option with intrimin.h, not sure
> > about
> > GCC's built-in).
>
> I perfer m256 too.  I'm already using vector_size(4*sizeof(double)) for
> some calculation in 3D euclid space (only 3 elements are really used).
>
> > However, if the asker doesn't want this for SIMD code, but wants a
> > convenient vector to index for scalar code, I'd recommend defining
> > your own
> > class. Indexing SIMD vectors is inefficient, and it may interfere
> > with
> > optimizations like SROA. But I could be wrong; my experience is
> > mostly with
> > Julia which uses LLVM. GCC may do better.
>
> I want SIMD code and I don't need much indexing.  But just curious, why
> indexing SIMD vectors is inefficient?
> --
> Xi Ruoyao <xry111@xxxxxxxxxxxxxxxx>
> School of Aerospace Science and Technology, Xidian University
>
>

-- 
https://github.com/chriselrod?tab=repositories
https://www.linkedin.com/in/chris-elrod-9720391a/