On 20/09/2017 19:36, Mikhail Maltsev wrote: > But at least it can be enabled manually: > > $ gcc test.c -S -O3 -march=skylake --param tree-reassoc-width=2 > Produces the following code: > > foo: > movl 12(%rsi), %eax > movl 28(%rsi), %edx > addl 4(%rsi), %eax > addl 20(%rsi), %edx > addl %edx, %eax > movl 12(%rdi), %edx > addl (%rdi), %edx > addl %edx, %eax > movl 36(%rdi), %edx > addl 24(%rdi), %edx > addl %edx, %eax > ret Thanks for pointing out that particular knob, I'll play with it for my actual use-case (a much longer RMW chain). The generated code is not quite as nice as my hand-written code, but I think it achieves the same latency, with the help of OOE. On typical x86 implementations, where L1 load = 3 cycles, and RMW op = 3+1 cycles, this code would execute in 8 cycles. movl 4*0(%rdi), %eax # START=0 END=3 movl 4*1(%rsi), %edx # START=0 END=3 addl 4*3(%rdi), %eax # START=1 END=5 addl 4*3(%rsi), %edx # START=1 END=5 addl 4*6(%rdi), %eax # START=2 END=6 addl 4*5(%rsi), %edx # START=2 END=6 addl 4*9(%rdi), %eax # START=3 END=7 addl 4*7(%rsi), %edx # START=3 END=7 addl %edx, %eax # START=7 END=8 VS movl 12(%rsi), %eax # START=0 END=3 movl 28(%rsi), %edx # START=0 END=3 addl 4(%rsi), %eax # START=1 END=5 addl 20(%rsi), %edx # START=1 END=5 addl %edx, %eax # START=5 END=6 movl 12(%rdi), %edx # START=2 END=5 addl (%rdi), %edx # START=2 END=6 addl %edx, %eax # START=6 END=7 movl 36(%rdi), %edx # START=3 END=6 addl 24(%rdi), %edx # START=3 END=7 addl %edx, %eax # START=7 END=8 Regards. > On Wed, Sep 20, 2017 at 8:33 PM, Jeff Law wrote: > > On 09/20/2017 09:54 AM, Mason wrote: > > Hello, > > > > Consider the following test case. > > > > typedef unsigned int u32; > > u32 foo(const u32 *u, const u32 *v) > > { > > u32 t0 = u[0] + u[3] + u[6] + u[9]; > > u32 t1 = v[1] + v[3] + v[5] + v[7]; > > return t0 + t1; > > } > > > > AFAIU, for several years, x86 implementations have been able > > to issue two loads per cycle, and I expected gcc to compute > > t0 and t1 in parallel. But instead, it creates a single > > dependency chain. > > > > $ gcc-7 -march=skylake -O3 -S testcase.c > > > > foo: > > movl 12(%rsi), %eax > > addl 4(%rsi), %eax > > addl 20(%rsi), %eax > > addl 28(%rsi), %eax > > addl (%rdi), %eax > > addl 12(%rdi), %eax > > addl 24(%rdi), %eax > > addl 36(%rdi), %eax > > ret > > > > I don't think this code would benefit from SSE or auto-vectorization. > > But computing t0 and t1 in parallel might give a non-trivial speedup, > > especially for longer chains. What do you think? > > It should. However, the re-association pass has comments that indicate > that these situations are fairly rare in practice. As a result it just > punts these chains given the cost in complexity to get them right > (particularly when you include the interactions with CSE) it just punts. > > jeff