On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help <gcc-help@xxxxxxxxxxx> wrote: > > on 2021/8/4 下午4:31, Richard Biener wrote: > > On Wed, 4 Aug 2021, Richard Sandiford wrote: > > > >> Hongtao Liu <crazylht@xxxxxxxxx> writes: > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help > >>> <gcc-help@xxxxxxxxxxx> wrote: > >>>> > >>>> Jan Hubicka <hubicka@xxxxxx> writes: > >>>>> Hi, > >>>>> here are updated scores. > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on > >>>>> compares > >>>>> base: mainline > >>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3 > >>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3. > >>>>> > >>>>> The short story is: > >>>>> > >>>>> 1) -O2 generic performance > >>>>> kabylake (Intel): > >>>>> very cheap > >>>>> SPEC/SPEC2006/FP/total ~ 8.32% > >>>>> SPEC/SPEC2006/total -0.38% 4.74% > >>>>> SPEC/SPEC2006/INT/total -0.91% -0.14% > >>>>> > >>>>> SPEC/SPEC2017/INT/total 4.71% 7.11% > >>>>> SPEC/SPEC2017/total 2.22% 6.52% > >>>>> SPEC/SPEC2017/FP/total 0.34% 6.06% > >>>>> zen > >>>>> SPEC/SPEC2006/FP/total 0.61% 10.23% > >>>>> SPEC/SPEC2006/total 0.26% 6.27% > >>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90% > >>>>> > >>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80% > >>>>> SPEC/SPEC2017/total 3.02% 6.55% > >>>>> SPEC/SPEC2017/FP/total 1.26% 5.60% > >>>>> > >>>>> 2) -O2 size: > >>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006 > >>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017 > >>>>> 3) build times: > >>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006 > >>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017 > >>>>> here I simply copied data from different configuratoins > >>>>> > >>>>> So for SPEC i would say that most of compile time costs are derrived > >>>>> from code size growth which is a problem with cheap model but not with > >>>>> very cheap. Very cheap indeed results in code size improvements and > >>>>> compile time impact is probably somewhere around 0.5% > >>>>> > >>>>> So from these scores alone this would seem that vectorization makes > >>>>> sense at -O2 with very cheap model to me (I am sure we have other > >>>>> optimizations with worse benefits to compile time tradeoffs). > >>>> > >>>> Thanks for running these. > >>>> > >>>> The biggest issue I know of for enabling very-cheap at -O2 is: > >>>> > >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089 > >>>> > >>>> Perhaps we could get around that by (hopefully temporarily) disabling > >>>> BB SLP within loop vectorisation for the very-cheap model. This would > >>>> purely be a workaround and we should remove it once the PR is fixed. > >>>> (It would even be a compile-time win in the meantime :-)) > >>>> > >>>> Thanks, > >>>> Richard > >>>> > >>>>> However there are usual arguments against: > >>>>> > >>>>> 1) Vectorizer being tuned for SPEC. I think the only way to overcome > >>>>> that argument is to enable it by default :) > >>>>> 2) Workloads improved are more of -Ofast type workloads > >>>>> > >>>>> Here are non-spec benchmarks we track: > >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on > >>>>> > >>>>> I also tried to run Firefox some time ago. Results are not surprising - > >>>>> vectorizaiton helps rendering benchmarks which are those compiler with > >>>>> aggressive flags anyway. > >>>>> > >>>>> Honza > >>> > >>> Hi: > >>> I would like to ask if we can turn on O2 vectorization now? > >> > >> I think we still need to deal with the PR100089 issue that I mentioned above. > >> Like I say, “dealing with” it could be as simple as disabling: > >> > >> /* If we applied if-conversion then try to vectorize the > >> BB of innermost loops. > >> ??? Ideally BB vectorization would learn to vectorize > >> control flow by applying if-conversion on-the-fly, the > >> following retains the if-converted loop body even when > >> only non-if-converted parts took part in BB vectorization. */ > >> if (flag_tree_slp_vectorize != 0 > >> && loop_vectorized_call > >> && ! loop->inner) > >> > >> for the very-cheap vector cost model until the PR is fixed properly. > > > > Alternatively only enable loop vectorization at -O2 (the above checks > > flag_tree_slp_vectorize as well). At least the cost model kind > > does not have any influence on BB vectorization, that is, we get the > > same pros and cons as we do for -O3. > > > > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet? > > > > > Here is the measured performance speedup at O2 vect with > very cheap cost model on both Power8 and Power9. > > INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap > FP: INT + -ffast-math > > Column titles are: > > <bmks> <both loop and slp> <loop only> <slp only> (+:improvement, -:degradation) > > Power8: > 500.perlbench_r 0.00% 0.00% 0.00% > 502.gcc_r 0.39% 0.78% 0.00% > 505.mcf_r 0.00% 0.00% 0.00% > 520.omnetpp_r 1.21% 0.30% 0.00% > 523.xalancbmk_r 0.00% 0.00% -0.57% > 525.x264_r 41.84% 42.55% 0.00% > 531.deepsjeng_r 0.00% -0.63% 0.00% > 541.leela_r -3.44% -2.75% 0.00% > 548.exchange2_r 1.66% 1.66% 0.00% > 557.xz_r 1.39% 1.04% 0.00% > Geomean 3.67% 3.64% -0.06% > > 503.bwaves_r 0.00% 0.00% 0.00% > 507.cactuBSSN_r 0.00% 0.29% 0.44% > 508.namd_r 0.00% 0.29% 0.00% > 510.parest_r 0.00% -0.36% -0.54% > 511.povray_r 0.63% 0.31% 0.94% > 519.lbm_r 2.71% 2.71% 0.00% > 521.wrf_r 1.04% 1.04% 0.00% > 526.blender_r -1.31% -0.78% 0.00% > 527.cam4_r -0.62% -0.31% -0.62% > 538.imagick_r 0.21% 0.21% -0.21% > 544.nab_r 0.00% 0.00% 0.00% > 549.fotonik3d_r 0.00% 0.00% 0.00% > 554.roms_r 0.30% 0.00% 0.00% > Geomean 0.22% 0.26% 0.00% > > Power9: > > 500.perlbench_r 0.62% 0.62% -1.54% > 502.gcc_r -0.60% -0.60% -0.81% > 505.mcf_r 2.05% 2.05% 0.00% > 520.omnetpp_r -2.41% -0.30% -0.60% > 523.xalancbmk_r -1.44% -2.30% -1.44% > 525.x264_r 24.26% 23.93% -0.33% > 531.deepsjeng_r 0.32% 0.32% 0.00% > 541.leela_r 0.39% 1.18% -0.39% > 548.exchange2_r 0.76% 0.76% 0.00% > 557.xz_r 0.36% 0.36% -0.36% > Geomean 2.19% 2.38% -0.55% > > 503.bwaves_r 0.00% 0.36% 0.00% > 507.cactuBSSN_r 0.00% 0.00% 0.00% > 508.namd_r -3.73% -0.31% -3.73% > 510.parest_r -0.21% -0.42% -0.42% > 511.povray_r -0.96% -1.59% 0.64% > 519.lbm_r 2.31% 2.31% 0.17% > 521.wrf_r 2.66% 2.66% 0.00% > 526.blender_r -1.96% -1.68% 1.40% > 527.cam4_r 0.00% 0.91% 1.81% > 538.imagick_r 0.39% -0.19% -10.29% // known noise, imagick_r can have big jitter on P9 box sometimes. > 544.nab_r 0.25% 0.00% 0.00% > 549.fotonik3d_r 0.94% 0.94% 0.00% > 554.roms_r 0.00% 0.00% -1.05% > Geomean -0.03% 0.22% -0.93% > > > As above, the gains are mainly from loop vectorization. > btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box. > > BR, > Kewen Here is data on CLX. + for performance means better. - for codesize means better. we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which did not exist in our last measurement w/ gcc11.0.0 20210317, it's not related to the fix of PR100089. others about the same as the last measurement. O2 -ftree-vectorize very-cheap loop vectorizer bb vectorizer codesize performce codesize performance codesize performance codesize performance 500.perlbench_r 0.34% 0.55% 0.05% 0.80% 0.29% 0.84% 502.gcc_r 0.29% -0.32% 0.02% -0.12% 0.27% -0.23% 505.mcf_r 1.36% -1.20%(noise) 0.00% -0.69% 1.16% -0.85% 520.omnetpp_r -0.65% -0.83% 0.05% -0.97% -0.70% -0.52% 523.xalancbmk_r 0.04% -0.59% 0.26% -0.56% -0.04% -0.52% 525.x264_r 1.29% 62.62% 1.18% 64.80% 0.13% -0.29% 531.deepsjeng_r 0.18% -0.44% 0.16% -0.03% -0.05% -0.50% 541.leela_r -1.10% -0.12% -0.11% 0.59% -0.99% -1.12% 548.exchange2_r -1.19% 0.34% -0.27% -0.29% -1.02% 0.17% 557.xz_r -0.53% -1.01%(cost model) -0.76% -0.10% -0.10% -1.28% geomean for intrate 0.00% 4.60% 0.06% 4.98% -0.11% -0.43% 503.bwaves_r -0.29% -1.19% (noise) 0.00% -0.86% -0.25% -0.43% 507.cactuBSSN_r 0.01% -0.55% 0.01% -0.35% 0.01% -0.37% 508.namd_r -0.61% 2.38% -0.13% -0.09% -0.67% 2.45% 510.parest_r -0.41% 0.10% -0.16% 0.62% -0.50% 0.72% 511.povray_r -1.76% 3.79% -0.03% 0.41% -1.74% 4.61% 519.lbm_r 0.38% -0.33% 0.00% -0.31% 0.38% 0.05% 521.wrf_r -0.85% 1.23% -0.03% 1.60% -0.94% 0.00% 526.blender_r -0.40% -1.21%(nosie) 0.00% -1.49% -0.43% -1.64% 527.cam4_r -0.27% 0.06% 0.10% -0.06% -0.39% -0.01% 538.imagick_r -0.97% 1.10% -0.09% 0.32% -0.90% 2.49% 544.nab_r -0.65% 0.09% 0.02% 0.20% -0.69% 0.09% 549.fotonik3d_r 3.36% 0.30% 2.42% 0.44% 0.93% -0.08% 554.roms_r -0.28% -0.20% 0.25% 0.06% -0.52% 0.00% geomean for fprate -0.22% 0.42% 0.18% 0.04% -0.44% 0.59% geomean -0.12% 2.22% 0.13% 2.16% -0.30% 0.15% -- BR, Hongtao