Re: Packed-simd SSE for only vectorized loops

Tim Prince <n8tm@xxxxxxx> · Sat, 31 Oct 2009 06:20:55 -0700

Fahimeh Yazdanpanah wrote:

I am working on autovectorization. Using gcc-4.3.3, 64-bit Ubuntu on an intel core2, and vectorization flags, I found that gcc produces packed-simd SSE opcodes for all instructions in vectorized loops and for some instructions in non-vectorized loops.  

Would you please let me know if there a flag or switch to disable gcc producing packed-simd instructions for non-vectorized loops? Or is there any way to distinguish between the packed-simd SSE instructions within vectorized loop and within non-vectorized loops?

There are many situations where it is correct for the compiler to use 
parallel instructions, even though only the scalar operand is used.  The 
parallel move instructions for register to register move have been 
preferred since first documented for Athlon-32, as they permit hardware 
register renaming by dropping the requirement to preserve contents of 
the unused slots in a scalar move. gcc also observes a similar 
work-around for performance stalls in certain conversions between float 
and double.
Back in the Athlon-32 days, these optimizations were performed only when 
-march=athlonxxx switches were set.  If it's worth it to you, you might 
try to find out what was changed to fix the performance problem for the 
Intel targets.
If you have an interesting story about what you intend to accomplish by 
spending the time to preserve the extra register slots during scalar 
moves, let's hear it.  However, you can't expect gcc to support some 
mode where you are caching data in those extra slots by asm while 
allowing normal compilation and optimization of source code.