Missed optimization opportunity wrt load chains

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Consider the following test case.

typedef unsigned int u32;
u32 foo(const u32 *u, const u32 *v)
{
	u32 t0 = u[0] + u[3] + u[6] + u[9];
	u32 t1 = v[1] + v[3] + v[5] + v[7];
	return t0 + t1;
}

AFAIU, for several years, x86 implementations have been able
to issue two loads per cycle, and I expected gcc to compute
t0 and t1 in parallel. But instead, it creates a single
dependency chain.

$ gcc-7 -march=skylake -O3 -S testcase.c

foo:
	movl	12(%rsi), %eax
	addl	4(%rsi), %eax
	addl	20(%rsi), %eax
	addl	28(%rsi), %eax
	addl	(%rdi), %eax
	addl	12(%rdi), %eax
	addl	24(%rdi), %eax
	addl	36(%rdi), %eax
	ret

I don't think this code would benefit from SSE or auto-vectorization.
But computing t0 and t1 in parallel might give a non-trivial speedup,
especially for longer chains. What do you think?

Regards.



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux