Re: Regular gcc benchmark runs for sparse-matrix vector multiplication?

Harald Anlauf <anlauf@xxxxxx> · Sun, 16 Dec 2018 23:14:06 +0100

On 12/15/18 17:50, Harald Anlauf wrote:
>> 1. Do you have a small test case that shows the problem ?
> 
> not yet.  I'd need to sample from the data (matrices) used,
> in case it turns out due to different tuning in gcc-9.

I have been able to reduce my application so that I better
understand when the apparent performance degradation shows up.
For the code below, performance is very similar for gcc 7, 8,
and 9 if no bounds-checking is used.  (I use bounds-checking
for development).  However, if bounds-checking is enabled,
I am seeing roughly the following penalty:

gcc 7, 8: + 60% runtime
gcc 9: + 90 % runtime

Thus the bounds-checking overhead is roughly 20-25% higher,
which I find hard to understand.  To my untrained eyes, the
dump-tree-original is essentially the same for all 3 compiler
versions, but the dump-tree-optimized shows significant differences
between 9 and former versions.

Here's the code and compiler options used:

module csc
  implicit none
  integer, parameter :: sp = 4, dp = 8, mp = sp, wp = dp, ip = 4
contains
  subroutine csc_times_vector (a, ja, ia, x, y, n)
  real(mp)   ,intent(in)    ::  a (:) ! coefficients of matrix A
  integer(ip),intent(in)    :: ja (:) ! row indices  of matrix A
  integer    ,intent(in)    :: ia (:) ! indices to a,ia for column indices
  real(wp)   ,intent(in)    ::  x (:) ! right hand side
  real(wp)   ,intent(inout) ::  y (:) ! left hand side
  integer    ,intent(in)    ::  n     ! number of columns
    integer :: i, j, k
    do j=1,n                    ! Outer loop j: columns of A
!CDIR ALTCODE=LOOPCNT
!CDIR NODEP
!DIR$ IVDEP
      do k = ia(j), ia(j+1)-1   ! Inner loop i: rows of (sparse) A
        i = ja(k)               ! (the i's are distinct for different j's)
        y(i) = y(i) + a(k) * x(j)
      end do
    end do
  end subroutine csc_times_vector
end module csc

FFLAGS="-O2 -g -march=skylake -mfpmath=sse -ftree-vectorize
-funroll-loops -fno-realloc-lhs -fopt-info -fcheck=bounds"

If there's interest, I can create a bugzilla with test program
and test data.  If people think that bounds-checking must be
expensive, then I will not waste anybody's time.

Thanks,
Harald