Re: Regular gcc benchmark runs for sparse-matrix vector multiplication?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/15/18 17:50, Harald Anlauf wrote:
>> 1. Do you have a small test case that shows the problem ?
> 
> not yet.  I'd need to sample from the data (matrices) used,
> in case it turns out due to different tuning in gcc-9.

I have been able to reduce my application so that I better
understand when the apparent performance degradation shows up.
For the code below, performance is very similar for gcc 7, 8,
and 9 if no bounds-checking is used.  (I use bounds-checking
for development).  However, if bounds-checking is enabled,
I am seeing roughly the following penalty:

gcc 7, 8: + 60% runtime
gcc 9: + 90 % runtime

Thus the bounds-checking overhead is roughly 20-25% higher,
which I find hard to understand.  To my untrained eyes, the
dump-tree-original is essentially the same for all 3 compiler
versions, but the dump-tree-optimized shows significant differences
between 9 and former versions.

Here's the code and compiler options used:

module csc
  implicit none
  integer, parameter :: sp = 4, dp = 8, mp = sp, wp = dp, ip = 4
contains
  subroutine csc_times_vector (a, ja, ia, x, y, n)
  real(mp)   ,intent(in)    ::  a (:) ! coefficients of matrix A
  integer(ip),intent(in)    :: ja (:) ! row indices  of matrix A
  integer    ,intent(in)    :: ia (:) ! indices to a,ia for column indices
  real(wp)   ,intent(in)    ::  x (:) ! right hand side
  real(wp)   ,intent(inout) ::  y (:) ! left hand side
  integer    ,intent(in)    ::  n     ! number of columns
    integer :: i, j, k
    do j=1,n                    ! Outer loop j: columns of A
!CDIR ALTCODE=LOOPCNT
!CDIR NODEP
!DIR$ IVDEP
      do k = ia(j), ia(j+1)-1   ! Inner loop i: rows of (sparse) A
        i = ja(k)               ! (the i's are distinct for different j's)
        y(i) = y(i) + a(k) * x(j)
      end do
    end do
  end subroutine csc_times_vector
end module csc


FFLAGS="-O2 -g -march=skylake -mfpmath=sse -ftree-vectorize
-funroll-loops -fno-realloc-lhs -fopt-info -fcheck=bounds"

If there's interest, I can create a bugzilla with test program
and test data.  If people think that bounds-checking must be
expensive, then I will not waste anybody's time.

Thanks,
Harald




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux