On 09/20/2017 09:54 AM, Mason wrote: > Hello, > > Consider the following test case. > > typedef unsigned int u32; > u32 foo(const u32 *u, const u32 *v) > { > u32 t0 = u[0] + u[3] + u[6] + u[9]; > u32 t1 = v[1] + v[3] + v[5] + v[7]; > return t0 + t1; > } > > AFAIU, for several years, x86 implementations have been able > to issue two loads per cycle, and I expected gcc to compute > t0 and t1 in parallel. But instead, it creates a single > dependency chain. > > $ gcc-7 -march=skylake -O3 -S testcase.c > > foo: > movl 12(%rsi), %eax > addl 4(%rsi), %eax > addl 20(%rsi), %eax > addl 28(%rsi), %eax > addl (%rdi), %eax > addl 12(%rdi), %eax > addl 24(%rdi), %eax > addl 36(%rdi), %eax > ret > > I don't think this code would benefit from SSE or auto-vectorization. > But computing t0 and t1 in parallel might give a non-trivial speedup, > especially for longer chains. What do you think? It should. However, the reassociation pass has comments that indicate that these situations are fairly rare in practice. As a result it just punts these chains given the cost in complexity to get them right (particularly when you include the interactions with CSE) it just punts. jeff