But at least it can be enabled manually: $ gcc test.c -S -O3 -march=skylake --param tree-reassoc-width=2 Produces the following code: foo: movl 12(%rsi), %eax movl 28(%rsi), %edx addl 4(%rsi), %eax addl 20(%rsi), %edx addl %edx, %eax movl 12(%rdi), %edx addl (%rdi), %edx addl %edx, %eax movl 36(%rdi), %edx addl 24(%rdi), %edx addl %edx, %eax ret On Wed, Sep 20, 2017 at 8:33 PM, Jeff Law <law@xxxxxxxxxx> wrote: > On 09/20/2017 09:54 AM, Mason wrote: > > Hello, > > > > Consider the following test case. > > > > typedef unsigned int u32; > > u32 foo(const u32 *u, const u32 *v) > > { > > u32 t0 = u[0] + u[3] + u[6] + u[9]; > > u32 t1 = v[1] + v[3] + v[5] + v[7]; > > return t0 + t1; > > } > > > > AFAIU, for several years, x86 implementations have been able > > to issue two loads per cycle, and I expected gcc to compute > > t0 and t1 in parallel. But instead, it creates a single > > dependency chain. > > > > $ gcc-7 -march=skylake -O3 -S testcase.c > > > > foo: > > movl 12(%rsi), %eax > > addl 4(%rsi), %eax > > addl 20(%rsi), %eax > > addl 28(%rsi), %eax > > addl (%rdi), %eax > > addl 12(%rdi), %eax > > addl 24(%rdi), %eax > > addl 36(%rdi), %eax > > ret > > > > I don't think this code would benefit from SSE or auto-vectorization. > > But computing t0 and t1 in parallel might give a non-trivial speedup, > > especially for longer chains. What do you think? > It should. However, the reassociation pass has comments that indicate > that these situations are fairly rare in practice. As a result it just > punts these chains given the cost in complexity to get them right > (particularly when you include the interactions with CSE) it just punts. > > jeff > -- Regards, Mikhail Maltsev