Hi, I just learned that for memory-bound codes one can improve the memory bandwidth significantly using non-temporal stores that are part of SSE2. However, I could not find out how to get gcc to generate corresponding code on a Core2. For the stream triad kernel, subroutine stream_kernel_triad (a, b, c, n, s) integer , intent(in) :: n double precision :: a(*), b(*), c(*) double precision, intent(in) :: s integer :: j do j = 1,n a(j) = b(j) + s*c(j) end do end subroutine stream_kernel_triad the Intel compiler shows a performance difference of +25% between "-opt-streaming-stores never" and "... auto (default)" or "... always". On my system, gcc-4.8's performance for this fragment matches exactly that one without NT stores from Intel's. Is there some trick or magic flag I need to specify? Harald