Re: Regular gcc benchmark runs for sparse-matrix vector multiplication?

Richard Biener <richard.guenther@xxxxxxxxx> · Mon, 17 Dec 2018 11:25:37 +0100

On Mon, Dec 17, 2018 at 11:16 AM Richard Biener
<richard.guenther@xxxxxxxxx> wrote:
>
> On Mon, Dec 17, 2018 at 8:34 AM Thomas Koenig <tkoenig@xxxxxxxxxxxxx> wrote:
> >
> > Hi Harald,
> >
> > > If there's interest, I can create a bugzilla with test program
> > > and test data.
> >
> > Please do.
> >
> > In an ideal world, people would use always use bounds checking,
> > with almost zero overhead.  This is not realistic, but we should
> > not regress on our way there :-)
>
> GCC 9 IL looks saner than the GCC 7/8 one.  Note both compilers
> have bound checks inside the innermost loop.  The main difference
> seems to be in loop header copying where GCC 9 is behaving
> much "better" IMHO.  It would be interesting to see whether
> -fno-tree-ch brings results of the compilers in-line again (even
> if it causes the code to run even more slow).

Oh, and -funroll-loops might be an issue as well given the large
number of branches inside the loop body.  Citing non-unrolled
innermost loop body from GCC 9:

.L19:
        testq   %r8, %r8
        jle     .L25
        cmpq    %r8, %r12
        jl      .L26
        movslq  (%rax), %rdx
        testq   %rdx, %rdx
        jle     .L27
        cmpq    %rdx, %r14
        jl      .L28
        cmpq    %r8, %rcx
        jl      .L29
        cmpq    %r11, %r13
        jl      .L30
        imulq   %r10, %rdx
        vxorpd  %xmm0, %xmm0, %xmm0
        vcvtss2sd       (%rsi), %xmm0, %xmm0
        subq    %r10, %rdx
        leaq    (%r15,%rdx,8), %rdx
        vmovsd  (%rdx), %xmm1
        incq    %r8
        vfmadd132sd     (%r9), %xmm1, %xmm0
        addq    %rbp, %rsi
        addq    %rbx, %rax
        vmovsd  %xmm0, (%rdx)
        cmpl    %r8d, %edi
        jg      .L19

branch density of the bounds-checking code is quite dense
and I suspect predictors don't like that very much.  You
might want to look at perf output with counting branch mispredicts.

For GCC it might make sense to more (read: very) aggressively
combine test&branches to abort()s.  Maybe the FE can already
do this for bound checks from a single statement?

Richard.

> Richard.
>
> > Regards
> >
> >         Thomas