Re: Missed optimization opportunity wrt load chains

Jeff Law <law@xxxxxxxxxx> · Wed, 20 Sep 2017 11:33:20 -0600



On 09/20/2017 09:54 AM, Mason wrote:
> Hello,
> 
> Consider the following test case.
> 
> typedef unsigned int u32;
> u32 foo(const u32 *u, const u32 *v)
> {
> 	u32 t0 = u[0] + u[3] + u[6] + u[9];
> 	u32 t1 = v[1] + v[3] + v[5] + v[7];
> 	return t0 + t1;
> }
> 
> AFAIU, for several years, x86 implementations have been able
> to issue two loads per cycle, and I expected gcc to compute
> t0 and t1 in parallel. But instead, it creates a single
> dependency chain.
> 
> $ gcc-7 -march=skylake -O3 -S testcase.c
> 
> foo:
> 	movl	12(%rsi), %eax
> 	addl	4(%rsi), %eax
> 	addl	20(%rsi), %eax
> 	addl	28(%rsi), %eax
> 	addl	(%rdi), %eax
> 	addl	12(%rdi), %eax
> 	addl	24(%rdi), %eax
> 	addl	36(%rdi), %eax
> 	ret
> 
> I don't think this code would benefit from SSE or auto-vectorization.
> But computing t0 and t1 in parallel might give a non-trivial speedup,
> especially for longer chains. What do you think?
It should.  However, the reassociation pass has comments that indicate
that these situations are fairly rare in practice.  As a result it just
punts these chains given the cost in complexity to get them right
(particularly when you include the interactions with CSE) it just punts.

jeff