Re: [PATCH] drm/i915: Use SSE4.1 movntdqa to accelerate reads from WC memory

Dave Gordon <david.s.gordon@xxxxxxxxx> · Mon, 18 Jul 2016 14:48:33 +0100

On 18/07/16 13:56, Tvrtko Ursulin wrote:

On 18/07/16 12:57, Dave Gordon wrote:
On 18/07/16 12:35, Chris Wilson wrote:
On Mon, Jul 18, 2016 at 12:15:32PM +0100, Tvrtko Ursulin wrote:
I am not sure about this, but looking at the raid6 for example, it
has a lot more annotations in cases like this.

It seems to be telling the compiler which memory ranges does each
instruction access, and also uses "asm volatile" - whether or not
that is really needed I don't know.

For example:
                  asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));

And:
                  asm volatile("movdqa %%xmm4,%0" : "=m" (q[d]));

Each one is telling the compiler the instruction is either reading
or writing respectively from a certain memory address.

You don't have any of that, and don't even specify nothing as an
output parameter so I am not sure if your code is safe.

The asm is correct. We do not modify either of the two pointers which we
pass in via register inputs, but the memory behind them - hence the
memory
clobber.

This is a choice of how much we let the compiler decide about
addressing, and how much we tell it about what the asm code really does.
The examples above get the compiler to generate *any* suitable
addressing mode for each specific location involved in the transfers, so
the compiler knows a lot about what's happening and can track where each
datum comes from and goes to.

OTOH Chris' code

+        asm("movntdqa   (%0), %%xmm0\n"
+            "movntdqa 16(%0), %%xmm1\n"
+            "movntdqa 32(%0), %%xmm2\n"
+            "movntdqa 48(%0), %%xmm3\n"
+            "movaps %%xmm0,   (%1)\n"
+            "movaps %%xmm1, 16(%1)\n"
+            "movaps %%xmm2, 32(%1)\n"
+            "movaps %%xmm3, 48(%1)\n"
+            :: "r" (src), "r" (dst) : "memory");

- doesn't need "volatile" because asm statements that have no output
operands are implicitly volatile.

- makes the compiler give us the source and destination *addresses* in a
register each; beyond that, it doesn't know what we're doing with them,
so the third ("clobbers") parameter has to say "memory" i.e. treat *all*
memory contents as unknown after this.

[[From GCC docs: The "memory" clobber tells the compiler that the
assembly code performs memory reads or writes to items other than those
listed in the input and output operands (for example, accessing the
memory pointed to by one of the input parameters). To ensure memory
contains correct values, GCC may need to flush specific register values
to memory before executing the asm. Further, the compiler does not
assume that any values read from memory before an asm remain unchanged
after that asm; it reloads them as needed. Using the "memory" clobber
effectively forms a read/write memory barrier for the compiler.]]

BTW, should we not tell it we've *also* clobbered %xmm[0-3]?

So they're both correct, just taking different approaches. I don't know
which would give the best performance for this specific case.

Cool, learn something new every day. :)

I've tried writing it as:

struct qw2 {
	u64	q[2];
} __attribute__((packed));

static void __memcpy_ntdqa(struct qw2 *dst, const struct qw2 *src, unsigned long len)
{
	kernel_fpu_begin();

	len >>= 4;
	while (len >= 4) {
		asm("movntdqa   (%0), %%xmm0" :: "r" (src), "m" (src[0]));
		asm("movntdqa 16(%0), %%xmm1" :: "r" (src), "m" (src[1]));

Couldn't this be just:
		asm("movntdqa %1, %%xmm1" :: "r" (src), "m" (src[1]));
thus letting the compiler supply the offset?

Does the compiler know about %xmm* registers? If so you could maybe get 
it to choose which to use in each instruction, or should at least tell 
it which ones are being clobbered.

		asm("movntdqa 32(%0), %%xmm2" :: "r" (src), "m" (src[2]));
		asm("movntdqa 48(%0), %%xmm3" :: "r" (src), "m" (src[3]));
		asm("movaps %%xmm0,   (%1)" : "=m" (dst[0]) : "r" (dst));
		asm("movaps %%xmm1, 16(%1)" : "=m" (dst[1]) : "r" (dst));
		asm("movaps %%xmm2, 32(%1)" : "=m" (dst[2]) : "r" (dst));
		asm("movaps %%xmm3, 48(%1)" : "=m" (dst[3]) : "r" (dst));
		src += 4;
		dst += 4;
		len -= 4;
	}
	while (len--) {
		asm("movntdqa (%0), %%xmm0" :: "r" (src), "m" (src[0]));
		asm("movaps %%xmm0, (%1)" : "=m" (dst[0]) : "r" (dst));
		src++;
		dst++;
	}

	kernel_fpu_end();
}

That appears to allow GCC to interleave SSE and normal instructions,
presumably that means it is trying to utilize the execution units better?

I wonder if it makes a difference in speed?

Old code main loop assembly looks like:

   58:   66 0f 38 2a 00          movntdqa (%rax),%xmm0
   5d:   66 0f 38 2a 48 10       movntdqa 0x10(%rax),%xmm1
   63:   66 0f 38 2a 50 20       movntdqa 0x20(%rax),%xmm2
   69:   66 0f 38 2a 58 30       movntdqa 0x30(%rax),%xmm3
   6f:   0f 29 01                movaps %xmm0,(%rcx)
   72:   0f 29 49 10             movaps %xmm1,0x10(%rcx)
   76:   0f 29 51 20             movaps %xmm2,0x20(%rcx)
   7a:   0f 29 59 30             movaps %xmm3,0x30(%rcx)
   7e:   49 83 e8 04             sub    $0x4,%r8
   82:   48 83 c0 40             add    $0x40,%rax
   86:   48 83 c1 40             add    $0x40,%rcx
   8a:   49 83 f8 03             cmp    $0x3,%r8
   8e:   77 c8                   ja     58 <i915_memcpy_from_wc+0x58>

While the above version generates:

   58:   66 0f 38 2a 00          movntdqa (%rax),%xmm0
   5d:   66 0f 38 2a 48 10       movntdqa 0x10(%rax),%xmm1
   63:   66 0f 38 2a 50 20       movntdqa 0x20(%rax),%xmm2
   69:   66 0f 38 2a 58 30       movntdqa 0x30(%rax),%xmm3
   6f:   49 83 e8 04             sub    $0x4,%r8
   73:   48 83 c0 40             add    $0x40,%rax
   77:   0f 29 01                movaps %xmm0,(%rcx)
   7a:   0f 29 49 10             movaps %xmm1,0x10(%rcx)
   7e:   0f 29 51 20             movaps %xmm2,0x20(%rcx)
   82:   0f 29 59 30             movaps %xmm3,0x30(%rcx)
   86:   48 83 c1 40             add    $0x40,%rcx
   8a:   49 83 f8 03             cmp    $0x3,%r8
   8e:   77 c8                   ja     58 <i915_memcpy_from_wc+0x58>

Perhaps marginally better as it separates the load and store more, but I 
don't think a couple of integer operations on registers are actually 
going to make any difference.

Interestingly, in both cases GCC does some in my mind futile
shuffling aroung between the two loops. Instead of just
carrying on with src and dst and len how they are, it goes
to use a different register set for the second loop:

So this reshuffling:

   90:   48 8d 42 fc             lea    -0x4(%rdx),%rax
   94:   83 e2 03                and    $0x3,%edx
   97:   48 c1 e8 02             shr    $0x2,%rax
   9b:   48 83 c0 01             add    $0x1,%rax
   9f:   48 c1 e0 06             shl    $0x6,%rax
   a3:   48 01 c6                add    %rax,%rsi
   a6:   48 01 c7                add    %rax,%rdi
   a9:   48 8d 42 ff             lea    -0x1(%rdx),%rax
   ad:   48 85 d2                test   %rdx,%rdx
   b0:   74 1a                   je     cc <i915_memcpy_from_wc+0xcc>

Strange ... more investigation required!

.Dave.

And then the second loop:

   b2:   66 0f 38 2a 06          movntdqa (%rsi),%xmm0
   b7:   48 83 e8 01             sub    $0x1,%rax
   bb:   48 83 c6 10             add    $0x10,%rsi
   bf:   0f 29 07                movaps %xmm0,(%rdi)
   c2:   48 83 c7 10             add    $0x10,%rdi
   c6:   48 83 f8 ff             cmp    $0xffffffffffffffff,%rax
   ca:   75 e6                   jne    b2 <i915_memcpy_from_wc+0xb2>

Any thoughts on this?

Regards,
Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx