question on gcc vector extensions for vector sizes > native SIMD width

"Barragy, Edward" <Edward.Barragy@xxxxxxx> · Wed, 30 Nov 2011 02:16:27 +0000

Hi -
I'm hoping to get some help / direction with the vector extensions in gcc 4.6 running on AMD Interlagos.
I've been using these as an alternative to compiler generated vectorization of loops.
Using typedefs, such as: typedef float v4sf __attribute__ ((vector_size (16)));, has allowed
a fairly painless mapping of floats to 4 packed floats in parts of my finite element code.
Gcc then reliably maps the usual *, + etc into packed SSE3 instructions.  Also works well with AVX.  
This in turn is giving ~> 3x improvement over the original scalar code, whereas the loop vectorizer
more or less fails completely.

What I'd like to try next is something like typedef float v16sf __attribute__((vector_size (64))) .
Gcc accepts that construct, but emits 16 x scalar instructions rather than 4 x packed SSE instructions.

How difficult would it be to modify gcc to map to 4 x packed SSE & where in the code would I look to get started?
Or, am I simply missing some flags / directives etc to get the packed SSE mapping?

There are a couple of reasons for wanting this.  On the CPU side, it gives something like a depth 4 unrolling.  
That in turn gives the CPU out of order execution engine lots of independent instructions to chew on, at least for
my data structure (which is an array of structures, each struct processed independently of the others).  For the
GPU / APU side of things, where the SIMD width is nominally 64, this transparently changes the layout of structs
in memory - which can be very important for performance on the GPU / APU.  There it would be something like
typedef float v64sf __attribute__((vector_size(256))) .   With a little luck, this typedef technique
should give good perf on CPU SIMD as well as GPU & APU SIMD without any surgery on the code data structures.
Thanks -
Ted