typedef union { float f[4] ; v4sf v ; } vector[100] ; for( i=100; --i>=0;) { row1[i] = vector[i].f[0]; row2[i] = vector[i].f[1]; row3[i] = vector[i].f[2]; row4[i] = vector[i].f[3]; }
Unroll by four: load four vectors, swap data around in registers, store four vectors.
Is this the only portable way to do to a pack/unpack without asm()?
If you want fully portable at the C level without using any conditionals, this is pretty much it. If you just don't want to use asm(), there are intrinsics you can use.
How do I set it up differently to trigger a pack/unpack optimization?
Perhaps the auto-vectorisers aren't smart enough (yet) to do this for you. If your goal is great performance, you really have to write a special version for every processor; although auto-vectorisation certainly can speed up things quite a bit, hand-written vector code can be *much* faster. A big part of the problem is that many vector insn sets are very limited, or just "different". Segher