Re: Missed optimization opportunity wrt load chains

Mason <slash.tmp@xxxxxxx> · Wed, 20 Sep 2017 22:36:34 +0200

On 20/09/2017 19:36, Mikhail Maltsev wrote:

> But at least it can be enabled manually:
> 
> $ gcc test.c -S -O3 -march=skylake --param tree-reassoc-width=2
> Produces the following code:
> 
> foo: 
>        movl    12(%rsi), %eax 
>        movl    28(%rsi), %edx 
>        addl    4(%rsi), %eax 
>        addl    20(%rsi), %edx 
>        addl    %edx, %eax 
>        movl    12(%rdi), %edx 
>        addl    (%rdi), %edx 
>        addl    %edx, %eax 
>        movl    36(%rdi), %edx 
>        addl    24(%rdi), %edx 
>        addl    %edx, %eax 
>        ret

Thanks for pointing out that particular knob, I'll play with it
for my actual use-case (a much longer RMW chain).

The generated code is not quite as nice as my hand-written code,
but I think it achieves the same latency, with the help of OOE.
On typical x86 implementations, where L1 load = 3 cycles, and
RMW op = 3+1 cycles, this code would execute in 8 cycles.

	movl	4*0(%rdi), %eax  # START=0 END=3
	movl	4*1(%rsi), %edx  # START=0 END=3
	addl	4*3(%rdi), %eax  # START=1 END=5
	addl	4*3(%rsi), %edx  # START=1 END=5
	addl	4*6(%rdi), %eax  # START=2 END=6
	addl	4*5(%rsi), %edx  # START=2 END=6
	addl	4*9(%rdi), %eax  # START=3 END=7
	addl	4*7(%rsi), %edx  # START=3 END=7
	addl	%edx, %eax       # START=7 END=8

VS

       movl    12(%rsi), %eax  # START=0 END=3
       movl    28(%rsi), %edx  # START=0 END=3
       addl    4(%rsi), %eax   # START=1 END=5
       addl    20(%rsi), %edx  # START=1 END=5
       addl    %edx, %eax      # START=5 END=6
       movl    12(%rdi), %edx  # START=2 END=5
       addl    (%rdi), %edx    # START=2 END=6
       addl    %edx, %eax      # START=6 END=7
       movl    36(%rdi), %edx  # START=3 END=6
       addl    24(%rdi), %edx  # START=3 END=7
       addl    %edx, %eax      # START=7 END=8

Regards.

> On Wed, Sep 20, 2017 at 8:33 PM, Jeff Law wrote:
> 
>     On 09/20/2017 09:54 AM, Mason wrote:
>     > Hello,
>     >
>     > Consider the following test case.
>     >
>     > typedef unsigned int u32;
>     > u32 foo(const u32 *u, const u32 *v)
>     > {
>     >       u32 t0 = u[0] + u[3] + u[6] + u[9];
>     >       u32 t1 = v[1] + v[3] + v[5] + v[7];
>     >       return t0 + t1;
>     > }
>     >
>     > AFAIU, for several years, x86 implementations have been able
>     > to issue two loads per cycle, and I expected gcc to compute
>     > t0 and t1 in parallel. But instead, it creates a single
>     > dependency chain.
>     >
>     > $ gcc-7 -march=skylake -O3 -S testcase.c
>     >
>     > foo:
>     >       movl    12(%rsi), %eax
>     >       addl    4(%rsi), %eax
>     >       addl    20(%rsi), %eax
>     >       addl    28(%rsi), %eax
>     >       addl    (%rdi), %eax
>     >       addl    12(%rdi), %eax
>     >       addl    24(%rdi), %eax
>     >       addl    36(%rdi), %eax
>     >       ret
>     >
>     > I don't think this code would benefit from SSE or auto-vectorization.
>     > But computing t0 and t1 in parallel might give a non-trivial speedup,
>     > especially for longer chains. What do you think?
>
>     It should.  However, the re-association pass has comments that indicate
>     that these situations are fairly rare in practice.  As a result it just
>     punts these chains given the cost in complexity to get them right
>     (particularly when you include the interactions with CSE) it just punts.
> 
>     jeff