Hi RenÃ, you might be completely right, I have yet to discover a better way of ordering the registers. But I wonder, wouldn't the statement in line 49 coupled with line 52 make sure that dpps is done? The blendps instruction in line 52 can not be computed unless the result of xmm13 from dpps in line 49 is known. By the time the program hits line 55, all dependencies on xmm8 are gone. I have to admit though that I am just guessing here, I don't think I have a good understanding yet as to how to deal with instruction dependencies... Â41  # Calculate C(1,:). Â42  movaps %xmm4, %xmm10 Â43  dpps $0xf1, %xmm8, %xmm10 Â44  movaps %xmm5, %xmm11 Â45  dpps $0xf2, %xmm8, %xmm11 Â46  movaps %xmm6, %xmm12 Â47  dpps $0xf4, %xmm8, %xmm12 Â48  movaps %xmm7, %xmm13 Â49  dpps $0xf8, %xmm8, %xmm13 Â50  blendps $0x01, %xmm10, %xmm11 Â51  blendps $0x03, %xmm11, %xmm12 Â52  blendps $0x07, %xmm12, %xmm13 Â53  addps %xmm13, %xmm0 Â54 Â55  movaps 0x20(A), %xmm8 On Sun, Mar 13, 2011 at 21:08, Renà Dudfield <renesd@xxxxxxxxx> wrote: > > Hi, > > I may be completely wrong... but I think you could avoid dependencies with this following block and xmm8? dpps is high latency, so maybe you can do some non dependent things while it does it's business? > >  # Calculate C(1,:). >  movaps %xmm4, %xmm10 >  dpps $0xf1, %xmm8, %xmm10 >  movaps %xmm5, %xmm11 >  dpps $0xf2, %xmm8, %xmm11 >  movaps %xmm6, %xmm12 >  dpps $0xf4, %xmm8, %xmm12 >  movaps %xmm7, %xmm13 >  dpps $0xf8, %xmm8, %xmm13 >  blendps $0x01, %xmm10, %xmm11 >  blendps $0x03, %xmm11, %xmm12 >  blendps $0x07, %xmm12, %xmm13 >  addps %xmm13, %xmm0 > >  movaps 0x20(A), %xmm8 > > > > 2011/3/13 Nicolas Bock <nicolasbock@xxxxxxxxx> >> >> I have attached a short test project that demonstrates what I am doing. >> >> I time this simply with the time function, i.e. >> >> $ time ./mul_SSE 100000000 >> >> real  Â0m1.037s >> user  Â0m1.036s >> sys   0m0.001s >> >> $ time ./mul_SSE4_1 100000000 >> >> real  Â0m2.006s >> user  Â0m2.003s >> sys   0m0.002s >> >> I assume that I have prepared the A matrix for SSE a little bit by >> "dilating" the elements into A = { A11, A11, A11, A11, A12, A12, ... Â}, >> while for SSE4.1 I am calling the multiply with the transpose of B. >> >> As these matrices are really small, they should be completely in L1, so >> the movaps operation should have pretty low latency. Since the SSE >> version uses 4 times more data for A than the SSE4.1 version, I am >> surprised that given the larger number of data movements for the SSE >> version it still beats the SSE4.1 version. But maybe I am just not >> coding this very intelligently. >> >> Any suggestions would be very welcome, >> >> Thanks already, nick >> >> >> On 03/12/11 01:20, Frederic Marmond wrote: >> > Hello Nicolas, >> > >> > Yes, it's the right place :) >> > could you please paste your code as well as your benchmark context ? >> > >> > Fred >> > >> > 2011/3/11 Nicolas Bock <nicolasbock@xxxxxxxxx >> > <mailto:nicolasbock@xxxxxxxxx>> >> > >> >   Hello list, >> > >> >   I am writing an assembly function that multiplies 2 4x4 single precision >> >   matrices. I wrote 2 versions, one using SSE the other using SSE4.1. What >> >   surprised me is that the SSE4.1 version fails to beat the SSE version, >> >   it is in fact slightly slower. >> > >> >   Is this the right place to ask for help? If anyone is interested I can >> >   post some code which would maybe clarify the situation a bit. >> > >> >   If this is not the right place, please ignore me... >> > >> >   nick >> > >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-assembly" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
- Follow-Ups:
- Re: 4x4 single-precision matrix product with SSE
- From: Nicolas Bock
- Re: 4x4 single-precision matrix product with SSE
- References:
- 4x4 single-precision matrix product with SSE
- From: Nicolas Bock
- Re: 4x4 single-precision matrix product with SSE
- From: Nicolas Bock
- 4x4 single-precision matrix product with SSE
- Prev by Date: Re: 4x4 single-precision matrix product with SSE
- Next by Date: latency and throughput
- Previous by thread: Re: 4x4 single-precision matrix product with SSE
- Next by thread: Re: 4x4 single-precision matrix product with SSE
- Index(es):