All of the SSE functions listed in the gcc manual return a value, i.e. we have v4sf __builtin_ia32_mulps (v4sf, v4sf) which returns the component-wise product of the two given vectors. However, the actual sse instruction mulps is accumulator-based. This seems to make gcc use quite a few temp registers when you call mulps, since it's trying to give a return value for an instruction that doesn't work that way. Is there any gcc builtin command set that uses accumulation, rather than returning values? That would really be nice, since that's how the instructions actually work, and that's really how I want to use them. Thanks for any feedback. PS, please reply-to-all, since I'm not subscribed to this list.