On 12/15/18 17:50, Harald Anlauf wrote: >> 1. Do you have a small test case that shows the problem ? > > not yet. I'd need to sample from the data (matrices) used, > in case it turns out due to different tuning in gcc-9. I have been able to reduce my application so that I better understand when the apparent performance degradation shows up. For the code below, performance is very similar for gcc 7, 8, and 9 if no bounds-checking is used. (I use bounds-checking for development). However, if bounds-checking is enabled, I am seeing roughly the following penalty: gcc 7, 8: + 60% runtime gcc 9: + 90 % runtime Thus the bounds-checking overhead is roughly 20-25% higher, which I find hard to understand. To my untrained eyes, the dump-tree-original is essentially the same for all 3 compiler versions, but the dump-tree-optimized shows significant differences between 9 and former versions. Here's the code and compiler options used: module csc implicit none integer, parameter :: sp = 4, dp = 8, mp = sp, wp = dp, ip = 4 contains subroutine csc_times_vector (a, ja, ia, x, y, n) real(mp) ,intent(in) :: a (:) ! coefficients of matrix A integer(ip),intent(in) :: ja (:) ! row indices of matrix A integer ,intent(in) :: ia (:) ! indices to a,ia for column indices real(wp) ,intent(in) :: x (:) ! right hand side real(wp) ,intent(inout) :: y (:) ! left hand side integer ,intent(in) :: n ! number of columns integer :: i, j, k do j=1,n ! Outer loop j: columns of A !CDIR ALTCODE=LOOPCNT !CDIR NODEP !DIR$ IVDEP do k = ia(j), ia(j+1)-1 ! Inner loop i: rows of (sparse) A i = ja(k) ! (the i's are distinct for different j's) y(i) = y(i) + a(k) * x(j) end do end do end subroutine csc_times_vector end module csc FFLAGS="-O2 -g -march=skylake -mfpmath=sse -ftree-vectorize -funroll-loops -fno-realloc-lhs -fopt-info -fcheck=bounds" If there's interest, I can create a bugzilla with test program and test data. If people think that bounds-checking must be expensive, then I will not waste anybody's time. Thanks, Harald