Fahimeh Yazdanpanah wrote:
Would you please let me know if it is possible to vectorize both inner and outer loops with gcc? If yes, please give me an example.
No, short of SSE4, on targets supporting parallel loads only with stride 1, the inner loop has to be made to conform to stride 1. OpenMP parallelization of outer loop when inner loop is vectorizable may be effective.