Re: moving data between x87 and xmm registers

"Gautam Sewani" <gautamcool88@xxxxxxxxx> · Fri, 6 Jun 2008 21:27:37 +0530



On Fri, Jun 6, 2008 at 4:11 AM, Tim Prince <TimothyPrince@xxxxxxxxxxxxx> wrote:
> Gautam Sewani wrote:
>>
>> That is very bad news indeed :-( .
>> Can anyone confirm this with some testing? (I am using a Core duo, and
>> don't have access to Core 2 Duo.)
>> Regards
>> Gautam
>> On Thu, Jun 5, 2008 at 7:26 PM, Frédéric Bastien <nouiz@xxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> With processor before core2 from intel, their was a bottleneck in the
>>> CPU that make all sse instruction being split in two. So as you have
>>> only two double in a sse instruction and if you have a processor with
>>> such a bottleneck, I see only 1 way to have a speed up. Use float
>>> instead of double. I know, this is not always an option. To my
>>> knowledge prescott cpu have this bottleneck.
>
> bad mix of top and bottom posting, some elided
>
> I don't see how this relates to the beginning of the thread.  It's true that
> some CPUs in the past (pentium-m, AMD before Barcelona) always split 128-bit
> operands into 2 64-bit operands.  This doesn't mean you should avoid
> parallel SSE2, although it may reinforce the point that you should consider
> whether you are going about your task the best way.
>

Hi,
Instead of the code I was originally referring to, I tried a very
simple task of adding an array of 2-dimensional vectors.For timing, I
used the Boost timer class.I made three versions - without utilizing
any sort of SIMD instructions (http://pastebin.com/m3e8838c2), using
SSE2 instructions via intel intrinsics (http://pastebin.com/m783f8e7d)
and using SSE2 instructions through GCC vector intrinsics
(http://pastebin.com/m6f36194e).  The best times obtained were without
using any SIMD instructions. For compiling I used -march=prescott and
-O3.
When I tried compiling without the -O3 flag, the code with the gcc
vector intrinsics was 1.5 times faster than the one without SIMD
instructions, and intel intrinsics code was the slowest.
Any help will be greatly appreciated.
Regards
Gautam