on 2021/8/4 下午4:31, Richard Biener wrote: > On Wed, 4 Aug 2021, Richard Sandiford wrote: > >> Hongtao Liu <crazylht@xxxxxxxxx> writes: >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help >>> <gcc-help@xxxxxxxxxxx> wrote: >>>> >>>> Jan Hubicka <hubicka@xxxxxx> writes: >>>>> Hi, >>>>> here are updated scores. >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on >>>>> compares >>>>> base: mainline >>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3 >>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3. >>>>> >>>>> The short story is: >>>>> >>>>> 1) -O2 generic performance >>>>> kabylake (Intel): >>>>> very cheap >>>>> SPEC/SPEC2006/FP/total ~ 8.32% >>>>> SPEC/SPEC2006/total -0.38% 4.74% >>>>> SPEC/SPEC2006/INT/total -0.91% -0.14% >>>>> >>>>> SPEC/SPEC2017/INT/total 4.71% 7.11% >>>>> SPEC/SPEC2017/total 2.22% 6.52% >>>>> SPEC/SPEC2017/FP/total 0.34% 6.06% >>>>> zen >>>>> SPEC/SPEC2006/FP/total 0.61% 10.23% >>>>> SPEC/SPEC2006/total 0.26% 6.27% >>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90% >>>>> >>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80% >>>>> SPEC/SPEC2017/total 3.02% 6.55% >>>>> SPEC/SPEC2017/FP/total 1.26% 5.60% >>>>> >>>>> 2) -O2 size: >>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006 >>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017 >>>>> 3) build times: >>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006 >>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017 >>>>> here I simply copied data from different configuratoins >>>>> >>>>> So for SPEC i would say that most of compile time costs are derrived >>>>> from code size growth which is a problem with cheap model but not with >>>>> very cheap. Very cheap indeed results in code size improvements and >>>>> compile time impact is probably somewhere around 0.5% >>>>> >>>>> So from these scores alone this would seem that vectorization makes >>>>> sense at -O2 with very cheap model to me (I am sure we have other >>>>> optimizations with worse benefits to compile time tradeoffs). >>>> >>>> Thanks for running these. >>>> >>>> The biggest issue I know of for enabling very-cheap at -O2 is: >>>> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089 >>>> >>>> Perhaps we could get around that by (hopefully temporarily) disabling >>>> BB SLP within loop vectorisation for the very-cheap model. This would >>>> purely be a workaround and we should remove it once the PR is fixed. >>>> (It would even be a compile-time win in the meantime :-)) >>>> >>>> Thanks, >>>> Richard >>>> >>>>> However there are usual arguments against: >>>>> >>>>> 1) Vectorizer being tuned for SPEC. I think the only way to overcome >>>>> that argument is to enable it by default :) >>>>> 2) Workloads improved are more of -Ofast type workloads >>>>> >>>>> Here are non-spec benchmarks we track: >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on >>>>> >>>>> I also tried to run Firefox some time ago. Results are not surprising - >>>>> vectorizaiton helps rendering benchmarks which are those compiler with >>>>> aggressive flags anyway. >>>>> >>>>> Honza >>> >>> Hi: >>> I would like to ask if we can turn on O2 vectorization now? >> >> I think we still need to deal with the PR100089 issue that I mentioned above. >> Like I say, “dealing with” it could be as simple as disabling: >> >> /* If we applied if-conversion then try to vectorize the >> BB of innermost loops. >> ??? Ideally BB vectorization would learn to vectorize >> control flow by applying if-conversion on-the-fly, the >> following retains the if-converted loop body even when >> only non-if-converted parts took part in BB vectorization. */ >> if (flag_tree_slp_vectorize != 0 >> && loop_vectorized_call >> && ! loop->inner) >> >> for the very-cheap vector cost model until the PR is fixed properly. > > Alternatively only enable loop vectorization at -O2 (the above checks > flag_tree_slp_vectorize as well). At least the cost model kind > does not have any influence on BB vectorization, that is, we get the > same pros and cons as we do for -O3. > > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet? > Here is the measured performance speedup at O2 vect with very cheap cost model on both Power8 and Power9. INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap FP: INT + -ffast-math Column titles are: <bmks> <both loop and slp> <loop only> <slp only> (+:improvement, -:degradation) Power8: 500.perlbench_r 0.00% 0.00% 0.00% 502.gcc_r 0.39% 0.78% 0.00% 505.mcf_r 0.00% 0.00% 0.00% 520.omnetpp_r 1.21% 0.30% 0.00% 523.xalancbmk_r 0.00% 0.00% -0.57% 525.x264_r 41.84% 42.55% 0.00% 531.deepsjeng_r 0.00% -0.63% 0.00% 541.leela_r -3.44% -2.75% 0.00% 548.exchange2_r 1.66% 1.66% 0.00% 557.xz_r 1.39% 1.04% 0.00% Geomean 3.67% 3.64% -0.06% 503.bwaves_r 0.00% 0.00% 0.00% 507.cactuBSSN_r 0.00% 0.29% 0.44% 508.namd_r 0.00% 0.29% 0.00% 510.parest_r 0.00% -0.36% -0.54% 511.povray_r 0.63% 0.31% 0.94% 519.lbm_r 2.71% 2.71% 0.00% 521.wrf_r 1.04% 1.04% 0.00% 526.blender_r -1.31% -0.78% 0.00% 527.cam4_r -0.62% -0.31% -0.62% 538.imagick_r 0.21% 0.21% -0.21% 544.nab_r 0.00% 0.00% 0.00% 549.fotonik3d_r 0.00% 0.00% 0.00% 554.roms_r 0.30% 0.00% 0.00% Geomean 0.22% 0.26% 0.00% Power9: 500.perlbench_r 0.62% 0.62% -1.54% 502.gcc_r -0.60% -0.60% -0.81% 505.mcf_r 2.05% 2.05% 0.00% 520.omnetpp_r -2.41% -0.30% -0.60% 523.xalancbmk_r -1.44% -2.30% -1.44% 525.x264_r 24.26% 23.93% -0.33% 531.deepsjeng_r 0.32% 0.32% 0.00% 541.leela_r 0.39% 1.18% -0.39% 548.exchange2_r 0.76% 0.76% 0.00% 557.xz_r 0.36% 0.36% -0.36% Geomean 2.19% 2.38% -0.55% 503.bwaves_r 0.00% 0.36% 0.00% 507.cactuBSSN_r 0.00% 0.00% 0.00% 508.namd_r -3.73% -0.31% -3.73% 510.parest_r -0.21% -0.42% -0.42% 511.povray_r -0.96% -1.59% 0.64% 519.lbm_r 2.31% 2.31% 0.17% 521.wrf_r 2.66% 2.66% 0.00% 526.blender_r -1.96% -1.68% 1.40% 527.cam4_r 0.00% 0.91% 1.81% 538.imagick_r 0.39% -0.19% -10.29% // known noise, imagick_r can have big jitter on P9 box sometimes. 544.nab_r 0.25% 0.00% 0.00% 549.fotonik3d_r 0.94% 0.94% 0.00% 554.roms_r 0.00% 0.00% -1.05% Geomean -0.03% 0.22% -0.93% As above, the gains are mainly from loop vectorization. btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box. BR, Kewen