On Mon, Aug 16, 2021 at 2:00 PM Hongtao Liu <crazylht@xxxxxxxxx> wrote: > > On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help > <gcc-help@xxxxxxxxxxx> wrote: > > > > on 2021/8/4 下午4:31, Richard Biener wrote: > > > On Wed, 4 Aug 2021, Richard Sandiford wrote: > > > > > >> Hongtao Liu <crazylht@xxxxxxxxx> writes: > > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help > > >>> <gcc-help@xxxxxxxxxxx> wrote: > > >>>> > > >>>> Jan Hubicka <hubicka@xxxxxx> writes: > > >>>>> Hi, > > >>>>> here are updated scores. > > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on > > >>>>> compares > > >>>>> base: mainline > > >>>>> 1st column: mainline with very cheap vectorization at -O2 and -O3 > > >>>>> 2nd column: mainline with cheap vectorization at -O2 and -O3. > > >>>>> > > >>>>> The short story is: > > >>>>> > > >>>>> 1) -O2 generic performance > > >>>>> kabylake (Intel): > > >>>>> very cheap > > >>>>> SPEC/SPEC2006/FP/total ~ 8.32% > > >>>>> SPEC/SPEC2006/total -0.38% 4.74% > > >>>>> SPEC/SPEC2006/INT/total -0.91% -0.14% > > >>>>> > > >>>>> SPEC/SPEC2017/INT/total 4.71% 7.11% > > >>>>> SPEC/SPEC2017/total 2.22% 6.52% > > >>>>> SPEC/SPEC2017/FP/total 0.34% 6.06% > > >>>>> zen > > >>>>> SPEC/SPEC2006/FP/total 0.61% 10.23% > > >>>>> SPEC/SPEC2006/total 0.26% 6.27% > > >>>>> SPEC/SPEC2006/INT/total 34.006 -0.24% 0.90% > > >>>>> > > >>>>> SPEC/SPEC2017/INT/total 3.937 5.34% 7.80% > > >>>>> SPEC/SPEC2017/total 3.02% 6.55% > > >>>>> SPEC/SPEC2017/FP/total 1.26% 5.60% > > >>>>> > > >>>>> 2) -O2 size: > > >>>>> -0.78% (very cheap) 6.51% (cheap) for spec2k2006 > > >>>>> -0.32% (very cheap) 6.75% (cheap) for spec2k2017 > > >>>>> 3) build times: > > >>>>> 0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006 > > >>>>> 0.39% 0.57% 0.71% (very cheap) 5.40% 6.23% 8.44% (cheap) for spec2k2017 > > >>>>> here I simply copied data from different configuratoins > > >>>>> > > >>>>> So for SPEC i would say that most of compile time costs are derrived > > >>>>> from code size growth which is a problem with cheap model but not with > > >>>>> very cheap. Very cheap indeed results in code size improvements and > > >>>>> compile time impact is probably somewhere around 0.5% > > >>>>> > > >>>>> So from these scores alone this would seem that vectorization makes > > >>>>> sense at -O2 with very cheap model to me (I am sure we have other > > >>>>> optimizations with worse benefits to compile time tradeoffs). > > >>>> > > >>>> Thanks for running these. > > >>>> > > >>>> The biggest issue I know of for enabling very-cheap at -O2 is: > > >>>> > > >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089 > > >>>> > > >>>> Perhaps we could get around that by (hopefully temporarily) disabling > > >>>> BB SLP within loop vectorisation for the very-cheap model. This would > > >>>> purely be a workaround and we should remove it once the PR is fixed. > > >>>> (It would even be a compile-time win in the meantime :-)) > > >>>> > > >>>> Thanks, > > >>>> Richard > > >>>> > > >>>>> However there are usual arguments against: > > >>>>> > > >>>>> 1) Vectorizer being tuned for SPEC. I think the only way to overcome > > >>>>> that argument is to enable it by default :) > > >>>>> 2) Workloads improved are more of -Ofast type workloads > > >>>>> > > >>>>> Here are non-spec benchmarks we track: > > >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on > > >>>>> > > >>>>> I also tried to run Firefox some time ago. Results are not surprising - > > >>>>> vectorizaiton helps rendering benchmarks which are those compiler with > > >>>>> aggressive flags anyway. > > >>>>> > > >>>>> Honza > > >>> > > >>> Hi: > > >>> I would like to ask if we can turn on O2 vectorization now? > > >> > > >> I think we still need to deal with the PR100089 issue that I mentioned above. > > >> Like I say, “dealing with” it could be as simple as disabling: > > >> > > >> /* If we applied if-conversion then try to vectorize the > > >> BB of innermost loops. > > >> ??? Ideally BB vectorization would learn to vectorize > > >> control flow by applying if-conversion on-the-fly, the > > >> following retains the if-converted loop body even when > > >> only non-if-converted parts took part in BB vectorization. */ > > >> if (flag_tree_slp_vectorize != 0 > > >> && loop_vectorized_call > > >> && ! loop->inner) > > >> > > >> for the very-cheap vector cost model until the PR is fixed properly. > > > > > > Alternatively only enable loop vectorization at -O2 (the above checks > > > flag_tree_slp_vectorize as well). At least the cost model kind > > > does not have any influence on BB vectorization, that is, we get the > > > same pros and cons as we do for -O3. > > > > > > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet? > > > > > > > > > Here is the measured performance speedup at O2 vect with > > very cheap cost model on both Power8 and Power9. > > > > INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap > > FP: INT + -ffast-math > > > > Column titles are: > > > > <bmks> <both loop and slp> <loop only> <slp only> (+:improvement, -:degradation) > > > > Power8: > > 500.perlbench_r 0.00% 0.00% 0.00% > > 502.gcc_r 0.39% 0.78% 0.00% > > 505.mcf_r 0.00% 0.00% 0.00% > > 520.omnetpp_r 1.21% 0.30% 0.00% > > 523.xalancbmk_r 0.00% 0.00% -0.57% > > 525.x264_r 41.84% 42.55% 0.00% > > 531.deepsjeng_r 0.00% -0.63% 0.00% > > 541.leela_r -3.44% -2.75% 0.00% > > 548.exchange2_r 1.66% 1.66% 0.00% > > 557.xz_r 1.39% 1.04% 0.00% > > Geomean 3.67% 3.64% -0.06% > > > > 503.bwaves_r 0.00% 0.00% 0.00% > > 507.cactuBSSN_r 0.00% 0.29% 0.44% > > 508.namd_r 0.00% 0.29% 0.00% > > 510.parest_r 0.00% -0.36% -0.54% > > 511.povray_r 0.63% 0.31% 0.94% > > 519.lbm_r 2.71% 2.71% 0.00% > > 521.wrf_r 1.04% 1.04% 0.00% > > 526.blender_r -1.31% -0.78% 0.00% > > 527.cam4_r -0.62% -0.31% -0.62% > > 538.imagick_r 0.21% 0.21% -0.21% > > 544.nab_r 0.00% 0.00% 0.00% > > 549.fotonik3d_r 0.00% 0.00% 0.00% > > 554.roms_r 0.30% 0.00% 0.00% > > Geomean 0.22% 0.26% 0.00% > > > > Power9: > > > > 500.perlbench_r 0.62% 0.62% -1.54% > > 502.gcc_r -0.60% -0.60% -0.81% > > 505.mcf_r 2.05% 2.05% 0.00% > > 520.omnetpp_r -2.41% -0.30% -0.60% > > 523.xalancbmk_r -1.44% -2.30% -1.44% > > 525.x264_r 24.26% 23.93% -0.33% > > 531.deepsjeng_r 0.32% 0.32% 0.00% > > 541.leela_r 0.39% 1.18% -0.39% > > 548.exchange2_r 0.76% 0.76% 0.00% > > 557.xz_r 0.36% 0.36% -0.36% > > Geomean 2.19% 2.38% -0.55% > > > > 503.bwaves_r 0.00% 0.36% 0.00% > > 507.cactuBSSN_r 0.00% 0.00% 0.00% > > 508.namd_r -3.73% -0.31% -3.73% > > 510.parest_r -0.21% -0.42% -0.42% > > 511.povray_r -0.96% -1.59% 0.64% > > 519.lbm_r 2.31% 2.31% 0.17% > > 521.wrf_r 2.66% 2.66% 0.00% > > 526.blender_r -1.96% -1.68% 1.40% > > 527.cam4_r 0.00% 0.91% 1.81% > > 538.imagick_r 0.39% -0.19% -10.29% // known noise, imagick_r can have big jitter on P9 box sometimes. > > 544.nab_r 0.25% 0.00% 0.00% > > 549.fotonik3d_r 0.94% 0.94% 0.00% > > 554.roms_r 0.00% 0.00% -1.05% > > Geomean -0.03% 0.22% -0.93% > > > > > > As above, the gains are mainly from loop vectorization. > > btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box. > > > > BR, > > Kewen > > Here is data on CLX. > + for performance means better. > - for codesize means better. > > we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which > did not exist in our last measurement w/ gcc11.0.0 20210317, it's not > related to the fix of PR100089. > others about the same as the last measurement. > O2 -ftree-vectorize very-cheap codesize performance 500.perlbench_r 0.34% 0.55% 502.gcc_r 0.29% -0.32% 505.mcf_r 1.36% -1.20%(noise) 520.omnetpp_r -0.65% -0.83% 523.xalancbmk_r 0.04% -0.59% 525.x264_r 1.29% 62.62% 531.deepsjeng_r 0.18% -0.44% 541.leela_r -1.10% -0.12% 548.exchange2_r -1.19% 0.34% 557.xz_r -0.53% -1.01%(cost model) geomean for intrate 0.00% 4.60% 503.bwaves_r -0.29% -1.19% 507.cactuBSSN_r 0.01% -0.55% 508.namd_r -0.61% 2.38% 510.parest_r -0.41% 0.10% 511.povray_r -1.76% 3.79% 519.lbm_r 0.38% -0.33% 521.wrf_r -0.85% 1.23% 526.blender_r -0.40% -1.21%(nosie) 527.cam4_r -0.27% 0.06% 538.imagick_r -0.97% 1.10% 544.nab_r -0.65% 0.09% 549.fotonik3d_r 3.36% 0.30% 554.roms_r -0.28% -0.20% geomean for fprate -0.22% 0.42% geomean -0.12% 2.22% loop vectorizer bb vectorizer codesize performance codesize performance 500.perlbench_r 0.05% 0.80% 0.29% 0.84% 502.gcc_r 0.02% -0.12% 0.27% -0.23% 505.mcf_r 0.00% -0.69% 1.16% -0.85% 520.omnetpp_r 0.05% -0.97% -0.70% -0.52% 523.xalancbmk_r 0.26% -0.56% -0.04% -0.52% 525.x264_r 1.18% 64.80% 0.13% -0.29% 531.deepsjeng_r 0.16% -0.03% -0.05% -0.50% 541.leela_r -0.11% 0.59% -0.99% -1.12% 548.exchange2_r -0.27% -0.29% -1.02% 0.17% 557.xz_r -0.76% -0.10% -0.10% -1.28% geomean for intrate 0.06% 4.98% -0.11% -0.43% 503.bwaves_r 0.00% -0.86% -0.25% -0.43% 507.cactuBSSN_r 0.01% -0.35% 0.01% -0.37% 508.namd_r -0.13% -0.09% -0.67% 2.45% 510.parest_r -0.16% 0.62% -0.50% 0.72% 511.povray_r -0.03% 0.41% -1.74% 4.61% 519.lbm_r 0.00% -0.31% 0.38% 0.05% 521.wrf_r -0.03% 1.60% -0.94% 0.00% 526.blender_r 0.00% -1.49% -0.43% -1.64% 527.cam4_r 0.10% -0.06% -0.39% -0.01% 538.imagick_r -0.09% 0.32% -0.90% 2.49% 544.nab_r 0.02% 0.20% -0.69% 0.09% 549.fotonik3d_r 2.42% 0.44% 0.93% -0.08% 554.roms_r 0.25% 0.06% -0.52% 0.00% geomean for fprate 0.18% 0.04% -0.44% 0.59% geomean 0.13% 2.16% -0.30% 0.15% -- BR, Hongtao