Vectorization regression (?) from gcc5 to gcc6 on Harpertown

Scott Paine <spaine@xxxxxxxxxxxxxxx> · Thu, 2 Mar 2017 14:52:08 -0500

Hi,

I’ve run across a loop that, when compiled with -O3, was successfully vectorized by gcc 5.4.0, but fails to be vectorized by gcc 6.3.0 for Intel Xeon E5462 (Harpertown).  The resulting loss in performance is very close to a factor of two.  On two newer machines, a Xeon E5-1650 v2 (Ivy Bridge) and Core i5 5257U (Broadwell), the loop continues to be vectorized by gcc6.  I am using MacPorts builds of gcc, and these are all Apple machines.

The problem code segment is:

.
.
 809             case LINESHAPE_GROSS:
 810                 {
 811                     double gamma = line_data[i].gamma;
 812                     double r0, r1, r2;
 813                     r0 = S * FOUR_ON_PI * gamma;
 814                     r1 = 4. * gamma * gamma;
 815                     r2 = f0 * f0;
 816                     if (zero_tol || pass == 0) {
 817                         gridsize_t j;
 818                         for (j = 0; j < ngrid; ++j) {
 819                             double r3, r4;
 820                             r3 = r1 * f2[j];
 821                             r4 = f2[j] - r2;
 822                             r4 *= r4;
 823                             k[j] += r0 / (r4 + r3);
 824                         }
 825                     } else {
 826                         gridsize_t j;
 827                         unsigned int dkflag = 0;
 828                         for (j = 0; j < ngrid; ++j) {
 829                             double r3, r4;
 830                             double delta_k;
 831                             r3 = r1 * f2[j];
 832                             r4 = f2[j] - r2;
 833                             r4 *= r4;
 834                             delta_k = r0 / (r4 + r3);
 835                             k[j] += delta_k;
 836                             dkflag += delta_k > dktol * k[j];
 837                         }
 838                         if (!dkflag)
 839                             Smin = S * (1. + DBL_EPSILON);
 840                     }
 841                 }
 842                 break;
.
.

Here, there are two versions of the same loop— a streamlined version starting at line 818, and one with extra tolerance monitoring code starting at line 828.  (The type gridsize_t  of the loop counter j is simply int.).  On all three machines mentioned, the loop starting at line 818 is vectorized by both gcc5 and gcc6, but on the Harpertown CPU, the loop starting at line 828 is vectorized by gcc5 but not vectorized by gcc6.

I’ve tried various options, including

    -O2 instead of -O3
    -fvect-cost-model=unlimited
    -march=core2 -msse4.1
    -march=native

but none of these have any effect.

I’ve looked at the outputs from -fopt-info-vec-missed, but I don’t really understand it.  For both cases there are lots of messages about things not vectorized in the loop starting at line 828:

$ grep -n 828: vec.miss.5
182:linesum.c:828:25: note: step unknown.
183:linesum.c:828:25: note: versioning for alias required: can't determine dependence between *_513 and *_520
185:linesum.c:828:25: note: Unknown misalignment, is_packed = 0
186:linesum.c:828:25: note: Unknown misalignment, is_packed = 0
187:linesum.c:828:25: note: Unknown misalignment, is_packed = 0
188:linesum.c:828:25: note: reduc phi. skip.
189:linesum.c:828:25: note: virtual phi. skip.
190:linesum.c:828:25: note: not ssa-name.
191:linesum.c:828:25: note: use not simple.
192:linesum.c:828:25: note: not ssa-name.
193:linesum.c:828:25: note: use not simple.
194:linesum.c:828:25: note: not ssa-name.
195:linesum.c:828:25: note: use not simple.
196:linesum.c:828:25: note: not ssa-name.
197:linesum.c:828:25: note: use not simple.
198:linesum.c:828:25: note: not ssa-name.
199:linesum.c:828:25: note: use not simple.
200:linesum.c:828:25: note: reduc op not supported by target.
201:linesum.c:828:25: note: reduc phi. skip.
202:linesum.c:828:25: note: virtual phi. skip.
203:linesum.c:828:25: note: reduc phi. skip.
204:linesum.c:828:25: note: virtual phi. skip.
205:linesum.c:828:25: note: reduc op not supported by target.
765:linesum.c:828:25: note: not vectorized: not enough data-refs in basic block.

$ grep -n 828: vec.miss.6
157:linesum.c:828:25: note: step unknown.
158:linesum.c:828:25: note: versioning for alias required: can't determine dependence between *_510 and *_517
160:linesum.c:828:25: note: Unknown misalignment, is_packed = 0
161:linesum.c:828:25: note: Unknown misalignment, is_packed = 0
162:linesum.c:828:25: note: Unknown misalignment, is_packed = 0
163:linesum.c:828:25: note: reduc phi. skip.
164:linesum.c:828:25: note: virtual phi. skip.
165:linesum.c:828:25: note: not ssa-name.
166:linesum.c:828:25: note: use not simple.
167:linesum.c:828:25: note: not ssa-name.
168:linesum.c:828:25: note: use not simple.
169:linesum.c:828:25: note: not ssa-name.
170:linesum.c:828:25: note: use not simple.
171:linesum.c:828:25: note: not ssa-name.
172:linesum.c:828:25: note: use not simple.
173:linesum.c:828:25: note: no optab.
174:linesum.c:828:25: note: no optab.
175:linesum.c:828:25: note: not vectorized: relevant stmt not supported: patt_1730 = _521 ? 1 : 0;
176:linesum.c:828:25: note: bad operation or unsupported loop bound.
558:linesum.c:828:25: note: not vectorized: not enough data-refs in basic block.
582:linesum.c:828:25: note: not vectorized: no grouped stores in basic block.

Looking at the assembly output from

$ gcc-mp-5 -g -fverbose-asm -O3 -S linesum.c -o linesum.s.5
$ gcc-mp-6 -g -fverbose-asm -O3 -S linesum.c -o linesum.s.6

gcc5 clearly generates vectorized code (plus what looks like scalar pre- and post-conditioning) for the loop starting at line 828, whereas gcc6 generates a simple scalar loop.  And, as mentioned, the measured performance drops a factor of 2.

Grepping for “delta_k” locates the relevant parts of the assembly and also give a quick sense of the differences:

$ grep -n delta_k linesum.s.5
3428:	movapd	%xmm2, %xmm1	# r0, delta_k
3430:	divsd	%xmm0, %xmm1	# D.6268, delta_k
3434:	addsd	%xmm1, %xmm0	# delta_k, D.6268
3439:	ucomisd	%xmm0, %xmm1	# D.6268, delta_k
3481:	movapd	%xmm10, %xmm13	# vect_cst_.146, vect_delta_k_542.145
3483:	movapd	%xmm10, %xmm8	# vect_cst_.146, vect_delta_k_542.145
3484:	divpd	%xmm1, %xmm13	# vect__541.144, vect_delta_k_542.145
3488:	divpd	%xmm0, %xmm8	# vect__541.144, vect_delta_k_542.145
3492:	addpd	%xmm13, %xmm0	# vect_delta_k_542.145, vect__545.151
3496:	cmpltpd	%xmm13, %xmm0	#, vect_delta_k_542.145, tmp2636
3499:	addpd	%xmm8, %xmm1	# vect_delta_k_542.145, vect__545.151
3505:	cmpltpd	%xmm8, %xmm1	#, vect_delta_k_542.145, tmp2641
3546:	movapd	%xmm2, %xmm1	# r0, delta_k
3548:	divsd	%xmm0, %xmm1	# D.6268, delta_k
3552:	addsd	%xmm1, %xmm0	# delta_k, D.6268
3558:	ucomisd	%xmm0, %xmm1	# D.6268, delta_k
3587:	movapd	%xmm2, %xmm1	# r0, delta_k
3589:	divsd	%xmm0, %xmm1	# D.6268, delta_k
3593:	addsd	%xmm1, %xmm0	# delta_k, D.6268
3599:	ucomisd	%xmm0, %xmm1	# D.6268, delta_k
3631:	divsd	%xmm3, %xmm2	# D.6268, delta_k
3637:	addsd	%xmm2, %xmm0	# delta_k, D.6268
3642:	ucomisd	%xmm0, %xmm2	# D.6268, delta_k
5469:	movapd	%xmm2, %xmm1	# r0, delta_k
5471:	divsd	%xmm0, %xmm1	# D.6268, delta_k
5475:	addsd	%xmm1, %xmm0	# delta_k, D.6268
5480:	ucomisd	%xmm0, %xmm1	# D.6268, delta_k
8359:	.ascii "delta_k\0"

$ grep -n delta_k linesum.s.6
2017:	movapd	%xmm2, %xmm1	# r0, delta_k
2019:	divsd	%xmm0, %xmm1	# tmp1859, delta_k
2023:	addsd	%xmm1, %xmm0	# delta_k, _542
2028:	ucomisd	%xmm0, %xmm1	# tmp1860, delta_k
8625:	.ascii "delta_k\0"

I’ve attached the files mentioned above, with the assembly edited down for size.  I’d be very grateful for any help understanding whether there’s something I could be doing differently, or if this is a genuine regression from gcc5 to gcc6.

Thanks,

Scott Paine
Smithsonian Astrophysical Observatory

Attachment:
linesum.s.5

Description: Binary data
Attachment:
linesum.s.6

Description: Binary data
Attachment:
vec.miss.5

Description: Binary data
Attachment:
vec.miss.6

Description: Binary data