Re: Why vectorization didn't turn on by -O2

"Kewen.Lin via Gcc-help" <gcc-help@xxxxxxxxxxx> · Mon, 16 Aug 2021 11:22:33 +0800

on 2021/8/4 下午4:31, Richard Biener wrote:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
> 
>> Hongtao Liu <crazylht@xxxxxxxxx> writes:
>>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>>> <gcc-help@xxxxxxxxxxx> wrote:
>>>>
>>>> Jan Hubicka <hubicka@xxxxxx> writes:
>>>>> Hi,
>>>>> here are updated scores.
>>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>> compares
>>>>>   base:  mainline
>>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
>>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
>>>>>
>>>>> The short story is:
>>>>>
>>>>> 1) -O2 generic performance
>>>>>     kabylake (Intel):
>>>>>                               very    cheap
>>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
>>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
>>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
>>>>>
>>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
>>>>>       SPEC/SPEC2017/total     2.22%   6.52%
>>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
>>>>>     zen
>>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
>>>>>       SPEC/SPEC2006/total     0.26%   6.27%
>>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
>>>>>
>>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
>>>>>       SPEC/SPEC2017/total     3.02%   6.55%
>>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
>>>>>
>>>>>  2) -O2 size:
>>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>>>>>  3) build times:
>>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>>>>>     here I simply copied data from different configuratoins
>>>>>
>>>>> So for SPEC i would say that most of compile time costs are derrived
>>>>> from code size growth which is a problem with cheap model but not with
>>>>> very cheap.  Very cheap indeed results in code size improvements and
>>>>> compile time impact is probably somewhere around 0.5%
>>>>>
>>>>> So from these scores alone this would seem that vectorization makes
>>>>> sense at -O2 with very cheap model to me (I am sure we have other
>>>>> optimizations with worse benefits to compile time tradeoffs).
>>>>
>>>> Thanks for running these.
>>>>
>>>> The biggest issue I know of for enabling very-cheap at -O2 is:
>>>>
>>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>>>>
>>>> Perhaps we could get around that by (hopefully temporarily) disabling
>>>> BB SLP within loop vectorisation for the very-cheap model.  This would
>>>> purely be a workaround and we should remove it once the PR is fixed.
>>>> (It would even be a compile-time win in the meantime :-))
>>>>
>>>> Thanks,
>>>> Richard
>>>>
>>>>> However there are usual arguments against:
>>>>>
>>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>>>>>      that argument is to enable it by default :)
>>>>>   2) Workloads improved are more of -Ofast type workloads
>>>>>
>>>>> Here are non-spec benchmarks we track:
>>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>>>>>
>>>>> I also tried to run Firefox some time ago. Results are not surprising -
>>>>> vectorizaiton helps rendering benchmarks which are those compiler with
>>>>> aggressive flags anyway.
>>>>>
>>>>> Honza
>>>
>>> Hi:
>>>   I would like to ask if we can turn on O2 vectorization now?
>>
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>>
>>       /* If we applied if-conversion then try to vectorize the
>> 	 BB of innermost loops.
>> 	 ???  Ideally BB vectorization would learn to vectorize
>> 	 control flow by applying if-conversion on-the-fly, the
>> 	 following retains the if-converted loop body even when
>> 	 only non-if-converted parts took part in BB vectorization.  */
>>       if (flag_tree_slp_vectorize != 0
>> 	  && loop_vectorized_call
>> 	  && ! loop->inner)
>>
>> for the very-cheap vector cost model until the PR is fixed properly.
> 
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
> 
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> 

Here is the measured performance speedup at O2 vect with
very cheap cost model on both Power8 and Power9.

INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
FP: INT + -ffast-math

Column titles are:

<bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)

Power8:
500.perlbench_r 	0.00% 0.00% 0.00%
502.gcc_r 		0.39% 0.78% 0.00%
505.mcf_r 		0.00% 0.00% 0.00%
520.omnetpp_r 		1.21% 0.30% 0.00%
523.xalancbmk_r 	0.00% 0.00% -0.57%
525.x264_r  		41.84%  42.55%  0.00%
531.deepsjeng_r 	0.00% -0.63%  0.00%
541.leela_r 		-3.44%  -2.75%  0.00%
548.exchange2_r 	1.66% 1.66% 0.00%
557.xz_r  		1.39% 1.04% 0.00%
Geomean 		3.67% 3.64% -0.06%

503.bwaves_r  		0.00% 0.00% 0.00%
507.cactuBSSN_r 	0.00% 0.29% 0.44%
508.namd_r  		0.00% 0.29% 0.00%
510.parest_r  		0.00% -0.36%  -0.54%
511.povray_r  		0.63% 0.31% 0.94%
519.lbm_r 		2.71% 2.71% 0.00%
521.wrf_r 		1.04% 1.04% 0.00%
526.blender_r 		-1.31%  -0.78%  0.00%
527.cam4_r  		-0.62%  -0.31%  -0.62%
538.imagick_r 		0.21% 0.21% -0.21%
544.nab_r 		0.00% 0.00% 0.00%
549.fotonik3d_r 	0.00% 0.00% 0.00%
554.roms_r  		0.30% 0.00% 0.00%
Geomean 		0.22% 0.26% 0.00%

Power9:

500.perlbench_r 	0.62% 0.62% -1.54%
502.gcc_r 		-0.60%  -0.60%  -0.81%
505.mcf_r 		2.05% 2.05% 0.00%
520.omnetpp_r 		-2.41%  -0.30%  -0.60%
523.xalancbmk_r 	-1.44%  -2.30%  -1.44%
525.x264_r  		24.26%  23.93%  -0.33%
531.deepsjeng_r 	0.32% 0.32% 0.00%
541.leela_r 		0.39% 1.18% -0.39%
548.exchange2_r 	0.76% 0.76% 0.00%
557.xz_r  		0.36% 0.36% -0.36%
Geomean 		2.19% 2.38% -0.55%

503.bwaves_r  		0.00% 0.36% 0.00%
507.cactuBSSN_r 	0.00% 0.00% 0.00%
508.namd_r  		-3.73%  -0.31%  -3.73%
510.parest_r  		-0.21%  -0.42%  -0.42%
511.povray_r  		-0.96%  -1.59%  0.64%
519.lbm_r 		2.31% 2.31% 0.17%
521.wrf_r 		2.66% 2.66% 0.00%
526.blender_r 		-1.96%  -1.68%  1.40%
527.cam4_r  		0.00% 0.91% 1.81%
538.imagick_r 		0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
544.nab_r 		0.25% 0.00% 0.00%
549.fotonik3d_r 	0.94% 0.94% 0.00%
554.roms_r  		0.00% 0.00% -1.05%
Geomean 		-0.03%  0.22% -0.93%

As above, the gains are mainly from loop vectorization.
btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.

BR,
Kewen