Take a simple suboptimal inline implementation of hadd with SSE2 using builtin intrinsics: typedef float v4sf __attribute__ ((vector_size (16))); inline v4sf hadd(v4sf src) { src = __builtin_ia32_addps(src, __builtin_ia32_movhlps(src, src)); return __builtin_ia32_addss(src, __builtin_ia32_shufps(src, src, 0xE5)); } This gets compiled to: movaps %xmm0, %xmm1 movhlps %xmm0, %xmm1 addps %xmm1, %xmm0 movaps %xmm0, %xmm1 shufps $229, %xmm0, %xmm1 addss %xmm1, %xmm0 Apparently a printf("%f\n", hadd(something)) works with the parameter in %xmm0 and doesn't even convert the single precision float to a double. But if you want to continue to pass the value in %xmm0 to a function which takes a float like this one: float foo(float a) { return a; } With foo(hadd(something)) you get: error: incompatible type for argument 1 of `foo' So consider this instead: inline float hadd(v4sf src) { src = __builtin_ia32_addps(src, __builtin_ia32_movhlps(src, src)); return (float)__builtin_ia32_addss(src, __builtin_ia32_shufps(src, src, 0xE5)); } Oops casting v4sf to float doesn't work even if there is no difference to the register the value is and will be returned in. This essentially prevents one from writing effective inline functions that should not store return values on the stack or somewhere else in memory. I really would like to know if there is a way around it. Jon Daniel