On 18/07/16 13:56, Tvrtko Ursulin wrote: > > On 18/07/16 12:57, Dave Gordon wrote: >> On 18/07/16 12:35, Chris Wilson wrote: >>> On Mon, Jul 18, 2016 at 12:15:32PM +0100, Tvrtko Ursulin wrote: >>>> I am not sure about this, but looking at the raid6 for example, it >>>> has a lot more annotations in cases like this. >>>> >>>> It seems to be telling the compiler which memory ranges does each >>>> instruction access, and also uses "asm volatile" - whether or not >>>> that is really needed I don't know. >>>> >>>> For example: >>>> asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); >>>> >>>> And: >>>> asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); >>>> >>>> Each one is telling the compiler the instruction is either reading >>>> or writing respectively from a certain memory address. >>>> >>>> You don't have any of that, and don't even specify nothing as an >>>> output parameter so I am not sure if your code is safe. >>> >>> The asm is correct. We do not modify either of the two pointers which we >>> pass in via register inputs, but the memory behind them - hence the >>> memory >>> clobber. >> >> This is a choice of how much we let the compiler decide about >> addressing, and how much we tell it about what the asm code really does. >> The examples above get the compiler to generate *any* suitable >> addressing mode for each specific location involved in the transfers, so >> the compiler knows a lot about what's happening and can track where each >> datum comes from and goes to. >> >> OTOH Chris' code >> >> + asm("movntdqa (%0), %%xmm0\n" >> + "movntdqa 16(%0), %%xmm1\n" >> + "movntdqa 32(%0), %%xmm2\n" >> + "movntdqa 48(%0), %%xmm3\n" >> + "movaps %%xmm0, (%1)\n" >> + "movaps %%xmm1, 16(%1)\n" >> + "movaps %%xmm2, 32(%1)\n" >> + "movaps %%xmm3, 48(%1)\n" >> + :: "r" (src), "r" (dst) : "memory"); >> >> - doesn't need "volatile" because asm statements that have no output >> operands are implicitly volatile. >> >> - makes the compiler give us the source and destination *addresses* in a >> register each; beyond that, it doesn't know what we're doing with them, >> so the third ("clobbers") parameter has to say "memory" i.e. treat *all* >> memory contents as unknown after this. >> >> [[From GCC docs: The "memory" clobber tells the compiler that the >> assembly code performs memory reads or writes to items other than those >> listed in the input and output operands (for example, accessing the >> memory pointed to by one of the input parameters). To ensure memory >> contains correct values, GCC may need to flush specific register values >> to memory before executing the asm. Further, the compiler does not >> assume that any values read from memory before an asm remain unchanged >> after that asm; it reloads them as needed. Using the "memory" clobber >> effectively forms a read/write memory barrier for the compiler.]] >> >> BTW, should we not tell it we've *also* clobbered %xmm[0-3]? >> >> So they're both correct, just taking different approaches. I don't know >> which would give the best performance for this specific case. > > Cool, learn something new every day. :) > > I've tried writing it as: > > struct qw2 { > u64 q[2]; > } __attribute__((packed)); > > static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len) > { > kernel_fpu_begin(); > > len >>= 4; > while (len >= 4) { > asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0])); > asm("movntdqa 16(%0), %%xmm1" :: "r" (src), "m" (src[1])); > asm("movntdqa 32(%0), %%xmm2" :: "r" (src), "m" (src[2])); > asm("movntdqa 48(%0), %%xmm3" :: "r" (src), "m" (src[3])); > asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst)); > asm("movaps %%xmm1, 16(%1)" : "=m" (dst[1]) : "r" (dst)); > asm("movaps %%xmm2, 32(%1)" : "=m" (dst[2]) : "r" (dst)); > asm("movaps %%xmm3, 48(%1)" : "=m" (dst[3]) : "r" (dst)); > src += 4; > dst += 4; > len -= 4; > } > while (len--) { > asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0])); > asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst)); > src++; > dst++; > } > > kernel_fpu_end(); > } > > That appears to allow GCC to interleave SSE and normal instructions, > presumably that means it is trying to utilize the execution units better? > > I wonder if it makes a difference in speed? > > > Old code main loop assembly looks like: > > 58: 66 0f 38 2a 00 movntdqa (%rax),%xmm0 > 5d: 66 0f 38 2a 48 10 movntdqa 0x10(%rax),%xmm1 > 63: 66 0f 38 2a 50 20 movntdqa 0x20(%rax),%xmm2 > 69: 66 0f 38 2a 58 30 movntdqa 0x30(%rax),%xmm3 > 6f: 0f 29 01 movaps %xmm0,(%rcx) > 72: 0f 29 49 10 movaps %xmm1,0x10(%rcx) > 76: 0f 29 51 20 movaps %xmm2,0x20(%rcx) > 7a: 0f 29 59 30 movaps %xmm3,0x30(%rcx) > 7e: 49 83 e8 04 sub $0x4,%r8 > 82: 48 83 c0 40 add $0x40,%rax > 86: 48 83 c1 40 add $0x40,%rcx > 8a: 49 83 f8 03 cmp $0x3,%r8 > 8e: 77 c8 ja 58 <i915_memcpy_from_wc+0x58> > > While the above version generates: > > 58: 66 0f 38 2a 00 movntdqa (%rax),%xmm0 > 5d: 66 0f 38 2a 48 10 movntdqa 0x10(%rax),%xmm1 > 63: 66 0f 38 2a 50 20 movntdqa 0x20(%rax),%xmm2 > 69: 66 0f 38 2a 58 30 movntdqa 0x30(%rax),%xmm3 > 6f: 49 83 e8 04 sub $0x4,%r8 > 73: 48 83 c0 40 add $0x40,%rax > 77: 0f 29 01 movaps %xmm0,(%rcx) > 7a: 0f 29 49 10 movaps %xmm1,0x10(%rcx) > 7e: 0f 29 51 20 movaps %xmm2,0x20(%rcx) > 82: 0f 29 59 30 movaps %xmm3,0x30(%rcx) > 86: 48 83 c1 40 add $0x40,%rcx > 8a: 49 83 f8 03 cmp $0x3,%r8 > 8e: 77 c8 ja 58 <i915_memcpy_from_wc+0x58> > > Interestingly, in both cases GCC does some in my mind futile > shuffling aroung between the two loops. Instead of just > carrying on with src and dst and len how they are, it goes > to use a different register set for the second loop: > > So this reshuffling: > > 90: 48 8d 42 fc lea -0x4(%rdx),%rax > 94: 83 e2 03 and $0x3,%edx > 97: 48 c1 e8 02 shr $0x2,%rax > 9b: 48 83 c0 01 add $0x1,%rax > 9f: 48 c1 e0 06 shl $0x6,%rax > a3: 48 01 c6 add %rax,%rsi > a6: 48 01 c7 add %rax,%rdi > a9: 48 8d 42 ff lea -0x1(%rdx),%rax > ad: 48 85 d2 test %rdx,%rdx > b0: 74 1a je cc <i915_memcpy_from_wc+0xcc> > > And then the second loop: > > b2: 66 0f 38 2a 06 movntdqa (%rsi),%xmm0 > b7: 48 83 e8 01 sub $0x1,%rax > bb: 48 83 c6 10 add $0x10,%rsi > bf: 0f 29 07 movaps %xmm0,(%rdi) > c2: 48 83 c7 10 add $0x10,%rdi > c6: 48 83 f8 ff cmp $0xffffffffffffffff,%rax > ca: 75 e6 jne b2 <i915_memcpy_from_wc+0xb2> > > Any thoughts on this? This version generates the smallest code: static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len) { unsigned long l4; kernel_fpu_begin(); l4 = len / 4; while (l4) { asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0])); asm("movntdqa 16(%0), %%xmm1" :: "r" (src), "m" (src[1])); asm("movntdqa 32(%0), %%xmm2" :: "r" (src), "m" (src[2])); asm("movntdqa 48(%0), %%xmm3" :: "r" (src), "m" (src[3])); asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst)); asm("movaps %%xmm1, 16(%1)" : "=m" (dst[1]) : "r" (dst)); asm("movaps %%xmm2, 32(%1)" : "=m" (dst[2]) : "r" (dst)); asm("movaps %%xmm3, 48(%1)" : "=m" (dst[3]) : "r" (dst)); src += 4; dst += 4; l4--; } len %= 4; while (len) { asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0])); asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst)); src++; dst++; len--; } kernel_fpu_end(); } Although I still haven't figured out a way to convince it to use the same registers for src and dest between the two loops. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx