Fwd: 4x4 single-precision matrix product with SSE

Nicolas Bock <nicolasbock@xxxxxxxxx> · Mon, 14 Mar 2011 15:43:08 +0000

Hi RenÃ,

you might be completely right, I have yet to discover a better way of
ordering the registers. But I wonder, wouldn't the statement in line
49 coupled with line 52 make sure that dpps is done? The blendps
instruction in line 52 can not be computed unless the result of xmm13
from dpps in line 49 is known. By the time the program hits line 55,
all dependencies on xmm8 are gone. I have to admit though that I am
just guessing here, I don't think I have a good understanding yet as
to how to deal with instruction dependencies...

Â41 Â # Calculate C(1,:).
Â42 Â movaps %xmm4, %xmm10
Â43 Â dpps $0xf1, %xmm8, %xmm10
Â44 Â movaps %xmm5, %xmm11
Â45 Â dpps $0xf2, %xmm8, %xmm11
Â46 Â movaps %xmm6, %xmm12
Â47 Â dpps $0xf4, %xmm8, %xmm12
Â48 Â movaps %xmm7, %xmm13
Â49 Â dpps $0xf8, %xmm8, %xmm13
Â50 Â blendps $0x01, %xmm10, %xmm11
Â51 Â blendps $0x03, %xmm11, %xmm12
Â52 Â blendps $0x07, %xmm12, %xmm13
Â53 Â addps %xmm13, %xmm0
Â54
Â55 Â movaps 0x20(A), %xmm8

On Sun, Mar 13, 2011 at 21:08, RenÃ Dudfield <renesd@xxxxxxxxx> wrote:
>
> Hi,
>
> I may be completely wrong... but I think you could avoid dependencies with this following block and xmm8?Â dpps is high latency, so maybe you can do some non dependent things while it does it's business?
>
> Â # Calculate C(1,:).
> Â movaps %xmm4, %xmm10
> Â dpps $0xf1, %xmm8, %xmm10
> Â movaps %xmm5, %xmm11
> Â dpps $0xf2, %xmm8, %xmm11
> Â movaps %xmm6, %xmm12
> Â dpps $0xf4, %xmm8, %xmm12
> Â movaps %xmm7, %xmm13
> Â dpps $0xf8, %xmm8, %xmm13
> Â blendps $0x01, %xmm10, %xmm11
> Â blendps $0x03, %xmm11, %xmm12
> Â blendps $0x07, %xmm12, %xmm13
> Â addps %xmm13, %xmm0
>
> Â movaps 0x20(A), %xmm8
>
>
>
> 2011/3/13 Nicolas Bock <nicolasbock@xxxxxxxxx>
>>
>> I have attached a short test project that demonstrates what I am doing.
>>
>> I time this simply with the time function, i.e.
>>
>> $ time ./mul_SSE 100000000
>>
>> real Â Â0m1.037s
>> user Â Â0m1.036s
>> sys Â Â 0m0.001s
>>
>> $ time ./mul_SSE4_1 100000000
>>
>> real Â Â0m2.006s
>> user Â Â0m2.003s
>> sys Â Â 0m0.002s
>>
>> I assume that I have prepared the A matrix for SSE a little bit by
>> "dilating" the elements into A = { A11, A11, A11, A11, A12, A12, ... Â},
>> while for SSE4.1 I am calling the multiply with the transpose of B.
>>
>> As these matrices are really small, they should be completely in L1, so
>> the movaps operation should have pretty low latency. Since the SSE
>> version uses 4 times more data for A than the SSE4.1 version, I am
>> surprised that given the larger number of data movements for the SSE
>> version it still beats the SSE4.1 version. But maybe I am just not
>> coding this very intelligently.
>>
>> Any suggestions would be very welcome,
>>
>> Thanks already, nick
>>
>>
>> On 03/12/11 01:20, Frederic Marmond wrote:
>> > Hello Nicolas,
>> >
>> > Yes, it's the right place :)
>> > could you please paste your code as well as your benchmark context ?
>> >
>> > Fred
>> >
>> > 2011/3/11 Nicolas Bock <nicolasbock@xxxxxxxxx
>> > <mailto:nicolasbock@xxxxxxxxx>>
>> >
>> > Â Â Hello list,
>> >
>> > Â Â I am writing an assembly function that multiplies 2 4x4 single precision
>> > Â Â matrices. I wrote 2 versions, one using SSE the other using SSE4.1. What
>> > Â Â surprised me is that the SSE4.1 version fails to beat the SSE version,
>> > Â Â it is in fact slightly slower.
>> >
>> > Â Â Is this the right place to ask for help? If anyone is interested I can
>> > Â Â post some code which would maybe clarify the situation a bit.
>> >
>> > Â Â If this is not the right place, please ignore me...
>> >
>> > Â Â nick
>> >
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-assembly" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Follow-Ups:

Re: 4x4 single-precision matrix product with SSE
From: Nicolas Bock

References:

4x4 single-precision matrix product with SSE
From: Nicolas Bock
Re: 4x4 single-precision matrix product with SSE
From: Nicolas Bock

Prev by Date:
Re: 4x4 single-precision matrix product with SSE

Next by Date:
latency and throughput

Previous by thread:
Re: 4x4 single-precision matrix product with SSE

Next by thread:
Re: 4x4 single-precision matrix product with SSE

Index(es):

Date
Thread

[Index of Archives]

[Kernel Newbies]

[Security]

[Linux C Programming]

[Linux for Hams]

[DCCP]

[Netfilter]

[Bugtraq]

[Yosemite News]

[MIPS Linux]

[ARM Linux]

[Linux RAID]

[Linux Admin]

[Samba]

[Video 4 Linux]