Re: Missed optimization opportunity wrt load chains

Mikhail Maltsev <maltsevm@xxxxxxxxx> · Wed, 20 Sep 2017 20:36:26 +0300

But at least it can be enabled manually:

$ gcc test.c -S -O3 -march=skylake --param tree-reassoc-width=2
Produces the following code:

foo:
       movl    12(%rsi), %eax
       movl    28(%rsi), %edx
       addl    4(%rsi), %eax
       addl    20(%rsi), %edx
       addl    %edx, %eax
       movl    12(%rdi), %edx
       addl    (%rdi), %edx
       addl    %edx, %eax
       movl    36(%rdi), %edx
       addl    24(%rdi), %edx
       addl    %edx, %eax
       ret

On Wed, Sep 20, 2017 at 8:33 PM, Jeff Law <law@xxxxxxxxxx> wrote:

> On 09/20/2017 09:54 AM, Mason wrote:
> > Hello,
> >
> > Consider the following test case.
> >
> > typedef unsigned int u32;
> > u32 foo(const u32 *u, const u32 *v)
> > {
> >       u32 t0 = u[0] + u[3] + u[6] + u[9];
> >       u32 t1 = v[1] + v[3] + v[5] + v[7];
> >       return t0 + t1;
> > }
> >
> > AFAIU, for several years, x86 implementations have been able
> > to issue two loads per cycle, and I expected gcc to compute
> > t0 and t1 in parallel. But instead, it creates a single
> > dependency chain.
> >
> > $ gcc-7 -march=skylake -O3 -S testcase.c
> >
> > foo:
> >       movl    12(%rsi), %eax
> >       addl    4(%rsi), %eax
> >       addl    20(%rsi), %eax
> >       addl    28(%rsi), %eax
> >       addl    (%rdi), %eax
> >       addl    12(%rdi), %eax
> >       addl    24(%rdi), %eax
> >       addl    36(%rdi), %eax
> >       ret
> >
> > I don't think this code would benefit from SSE or auto-vectorization.
> > But computing t0 and t1 in parallel might give a non-trivial speedup,
> > especially for longer chains. What do you think?
> It should.  However, the reassociation pass has comments that indicate
> that these situations are fairly rare in practice.  As a result it just
> punts these chains given the cost in complexity to get them right
> (particularly when you include the interactions with CSE) it just punts.
>
> jeff
>

-- 
Regards,
   Mikhail Maltsev