Hi, I’ve run across a loop that, when compiled with -O3, was successfully vectorized by gcc 5.4.0, but fails to be vectorized by gcc 6.3.0 for Intel Xeon E5462 (Harpertown). The resulting loss in performance is very close to a factor of two. On two newer machines, a Xeon E5-1650 v2 (Ivy Bridge) and Core i5 5257U (Broadwell), the loop continues to be vectorized by gcc6. I am using MacPorts builds of gcc, and these are all Apple machines. The problem code segment is: . . 809 case LINESHAPE_GROSS: 810 { 811 double gamma = line_data[i].gamma; 812 double r0, r1, r2; 813 r0 = S * FOUR_ON_PI * gamma; 814 r1 = 4. * gamma * gamma; 815 r2 = f0 * f0; 816 if (zero_tol || pass == 0) { 817 gridsize_t j; 818 for (j = 0; j < ngrid; ++j) { 819 double r3, r4; 820 r3 = r1 * f2[j]; 821 r4 = f2[j] - r2; 822 r4 *= r4; 823 k[j] += r0 / (r4 + r3); 824 } 825 } else { 826 gridsize_t j; 827 unsigned int dkflag = 0; 828 for (j = 0; j < ngrid; ++j) { 829 double r3, r4; 830 double delta_k; 831 r3 = r1 * f2[j]; 832 r4 = f2[j] - r2; 833 r4 *= r4; 834 delta_k = r0 / (r4 + r3); 835 k[j] += delta_k; 836 dkflag += delta_k > dktol * k[j]; 837 } 838 if (!dkflag) 839 Smin = S * (1. + DBL_EPSILON); 840 } 841 } 842 break; . . Here, there are two versions of the same loop— a streamlined version starting at line 818, and one with extra tolerance monitoring code starting at line 828. (The type gridsize_t of the loop counter j is simply int.). On all three machines mentioned, the loop starting at line 818 is vectorized by both gcc5 and gcc6, but on the Harpertown CPU, the loop starting at line 828 is vectorized by gcc5 but not vectorized by gcc6. I’ve tried various options, including -O2 instead of -O3 -fvect-cost-model=unlimited -march=core2 -msse4.1 -march=native but none of these have any effect. I’ve looked at the outputs from -fopt-info-vec-missed, but I don’t really understand it. For both cases there are lots of messages about things not vectorized in the loop starting at line 828: $ grep -n 828: vec.miss.5 182:linesum.c:828:25: note: step unknown. 183:linesum.c:828:25: note: versioning for alias required: can't determine dependence between *_513 and *_520 185:linesum.c:828:25: note: Unknown misalignment, is_packed = 0 186:linesum.c:828:25: note: Unknown misalignment, is_packed = 0 187:linesum.c:828:25: note: Unknown misalignment, is_packed = 0 188:linesum.c:828:25: note: reduc phi. skip. 189:linesum.c:828:25: note: virtual phi. skip. 190:linesum.c:828:25: note: not ssa-name. 191:linesum.c:828:25: note: use not simple. 192:linesum.c:828:25: note: not ssa-name. 193:linesum.c:828:25: note: use not simple. 194:linesum.c:828:25: note: not ssa-name. 195:linesum.c:828:25: note: use not simple. 196:linesum.c:828:25: note: not ssa-name. 197:linesum.c:828:25: note: use not simple. 198:linesum.c:828:25: note: not ssa-name. 199:linesum.c:828:25: note: use not simple. 200:linesum.c:828:25: note: reduc op not supported by target. 201:linesum.c:828:25: note: reduc phi. skip. 202:linesum.c:828:25: note: virtual phi. skip. 203:linesum.c:828:25: note: reduc phi. skip. 204:linesum.c:828:25: note: virtual phi. skip. 205:linesum.c:828:25: note: reduc op not supported by target. 765:linesum.c:828:25: note: not vectorized: not enough data-refs in basic block. $ grep -n 828: vec.miss.6 157:linesum.c:828:25: note: step unknown. 158:linesum.c:828:25: note: versioning for alias required: can't determine dependence between *_510 and *_517 160:linesum.c:828:25: note: Unknown misalignment, is_packed = 0 161:linesum.c:828:25: note: Unknown misalignment, is_packed = 0 162:linesum.c:828:25: note: Unknown misalignment, is_packed = 0 163:linesum.c:828:25: note: reduc phi. skip. 164:linesum.c:828:25: note: virtual phi. skip. 165:linesum.c:828:25: note: not ssa-name. 166:linesum.c:828:25: note: use not simple. 167:linesum.c:828:25: note: not ssa-name. 168:linesum.c:828:25: note: use not simple. 169:linesum.c:828:25: note: not ssa-name. 170:linesum.c:828:25: note: use not simple. 171:linesum.c:828:25: note: not ssa-name. 172:linesum.c:828:25: note: use not simple. 173:linesum.c:828:25: note: no optab. 174:linesum.c:828:25: note: no optab. 175:linesum.c:828:25: note: not vectorized: relevant stmt not supported: patt_1730 = _521 ? 1 : 0; 176:linesum.c:828:25: note: bad operation or unsupported loop bound. 558:linesum.c:828:25: note: not vectorized: not enough data-refs in basic block. 582:linesum.c:828:25: note: not vectorized: no grouped stores in basic block. Looking at the assembly output from $ gcc-mp-5 -g -fverbose-asm -O3 -S linesum.c -o linesum.s.5 $ gcc-mp-6 -g -fverbose-asm -O3 -S linesum.c -o linesum.s.6 gcc5 clearly generates vectorized code (plus what looks like scalar pre- and post-conditioning) for the loop starting at line 828, whereas gcc6 generates a simple scalar loop. And, as mentioned, the measured performance drops a factor of 2. Grepping for “delta_k” locates the relevant parts of the assembly and also give a quick sense of the differences: $ grep -n delta_k linesum.s.5 3428: movapd %xmm2, %xmm1 # r0, delta_k 3430: divsd %xmm0, %xmm1 # D.6268, delta_k 3434: addsd %xmm1, %xmm0 # delta_k, D.6268 3439: ucomisd %xmm0, %xmm1 # D.6268, delta_k 3481: movapd %xmm10, %xmm13 # vect_cst_.146, vect_delta_k_542.145 3483: movapd %xmm10, %xmm8 # vect_cst_.146, vect_delta_k_542.145 3484: divpd %xmm1, %xmm13 # vect__541.144, vect_delta_k_542.145 3488: divpd %xmm0, %xmm8 # vect__541.144, vect_delta_k_542.145 3492: addpd %xmm13, %xmm0 # vect_delta_k_542.145, vect__545.151 3496: cmpltpd %xmm13, %xmm0 #, vect_delta_k_542.145, tmp2636 3499: addpd %xmm8, %xmm1 # vect_delta_k_542.145, vect__545.151 3505: cmpltpd %xmm8, %xmm1 #, vect_delta_k_542.145, tmp2641 3546: movapd %xmm2, %xmm1 # r0, delta_k 3548: divsd %xmm0, %xmm1 # D.6268, delta_k 3552: addsd %xmm1, %xmm0 # delta_k, D.6268 3558: ucomisd %xmm0, %xmm1 # D.6268, delta_k 3587: movapd %xmm2, %xmm1 # r0, delta_k 3589: divsd %xmm0, %xmm1 # D.6268, delta_k 3593: addsd %xmm1, %xmm0 # delta_k, D.6268 3599: ucomisd %xmm0, %xmm1 # D.6268, delta_k 3631: divsd %xmm3, %xmm2 # D.6268, delta_k 3637: addsd %xmm2, %xmm0 # delta_k, D.6268 3642: ucomisd %xmm0, %xmm2 # D.6268, delta_k 5469: movapd %xmm2, %xmm1 # r0, delta_k 5471: divsd %xmm0, %xmm1 # D.6268, delta_k 5475: addsd %xmm1, %xmm0 # delta_k, D.6268 5480: ucomisd %xmm0, %xmm1 # D.6268, delta_k 8359: .ascii "delta_k\0" $ grep -n delta_k linesum.s.6 2017: movapd %xmm2, %xmm1 # r0, delta_k 2019: divsd %xmm0, %xmm1 # tmp1859, delta_k 2023: addsd %xmm1, %xmm0 # delta_k, _542 2028: ucomisd %xmm0, %xmm1 # tmp1860, delta_k 8625: .ascii "delta_k\0" I’ve attached the files mentioned above, with the assembly edited down for size. I’d be very grateful for any help understanding whether there’s something I could be doing differently, or if this is a genuine regression from gcc5 to gcc6. Thanks, Scott Paine Smithsonian Astrophysical Observatory
Attachment:
linesum.s.5
Description: Binary data
Attachment:
linesum.s.6
Description: Binary data
Attachment:
vec.miss.5
Description: Binary data
Attachment:
vec.miss.6
Description: Binary data