Re: Why vectorization didn't turn on by -O2

Hongtao Liu via Gcc-help <gcc-help@xxxxxxxxxxx> · Mon, 16 Aug 2021 14:09:17 +0800



On Mon, Aug 16, 2021 at 2:00 PM Hongtao Liu <crazylht@xxxxxxxxx> wrote:
>
> On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
> <gcc-help@xxxxxxxxxxx> wrote:
> >
> > on 2021/8/4 下午4:31, Richard Biener wrote:
> > > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> > >
> > >> Hongtao Liu <crazylht@xxxxxxxxx> writes:
> > >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > >>> <gcc-help@xxxxxxxxxxx> wrote:
> > >>>>
> > >>>> Jan Hubicka <hubicka@xxxxxx> writes:
> > >>>>> Hi,
> > >>>>> here are updated scores.
> > >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >>>>> compares
> > >>>>>   base:  mainline
> > >>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
> > >>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
> > >>>>>
> > >>>>> The short story is:
> > >>>>>
> > >>>>> 1) -O2 generic performance
> > >>>>>     kabylake (Intel):
> > >>>>>                               very    cheap
> > >>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
> > >>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
> > >>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> > >>>>>
> > >>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
> > >>>>>       SPEC/SPEC2017/total     2.22%   6.52%
> > >>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
> > >>>>>     zen
> > >>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
> > >>>>>       SPEC/SPEC2006/total     0.26%   6.27%
> > >>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> > >>>>>
> > >>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> > >>>>>       SPEC/SPEC2017/total     3.02%   6.55%
> > >>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
> > >>>>>
> > >>>>>  2) -O2 size:
> > >>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > >>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > >>>>>  3) build times:
> > >>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > >>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> > >>>>>     here I simply copied data from different configuratoins
> > >>>>>
> > >>>>> So for SPEC i would say that most of compile time costs are derrived
> > >>>>> from code size growth which is a problem with cheap model but not with
> > >>>>> very cheap.  Very cheap indeed results in code size improvements and
> > >>>>> compile time impact is probably somewhere around 0.5%
> > >>>>>
> > >>>>> So from these scores alone this would seem that vectorization makes
> > >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> > >>>>> optimizations with worse benefits to compile time tradeoffs).
> > >>>>
> > >>>> Thanks for running these.
> > >>>>
> > >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> > >>>>
> > >>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > >>>>
> > >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> > >>>> BB SLP within loop vectorisation for the very-cheap model.  This would
> > >>>> purely be a workaround and we should remove it once the PR is fixed.
> > >>>> (It would even be a compile-time win in the meantime :-))
> > >>>>
> > >>>> Thanks,
> > >>>> Richard
> > >>>>
> > >>>>> However there are usual arguments against:
> > >>>>>
> > >>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> > >>>>>      that argument is to enable it by default :)
> > >>>>>   2) Workloads improved are more of -Ofast type workloads
> > >>>>>
> > >>>>> Here are non-spec benchmarks we track:
> > >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >>>>>
> > >>>>> I also tried to run Firefox some time ago. Results are not surprising -
> > >>>>> vectorizaiton helps rendering benchmarks which are those compiler with
> > >>>>> aggressive flags anyway.
> > >>>>>
> > >>>>> Honza
> > >>>
> > >>> Hi:
> > >>>   I would like to ask if we can turn on O2 vectorization now?
> > >>
> > >> I think we still need to deal with the PR100089 issue that I mentioned above.
> > >> Like I say, “dealing with” it could be as simple as disabling:
> > >>
> > >>       /* If we applied if-conversion then try to vectorize the
> > >>       BB of innermost loops.
> > >>       ???  Ideally BB vectorization would learn to vectorize
> > >>       control flow by applying if-conversion on-the-fly, the
> > >>       following retains the if-converted loop body even when
> > >>       only non-if-converted parts took part in BB vectorization.  */
> > >>       if (flag_tree_slp_vectorize != 0
> > >>        && loop_vectorized_call
> > >>        && ! loop->inner)
> > >>
> > >> for the very-cheap vector cost model until the PR is fixed properly.
> > >
> > > Alternatively only enable loop vectorization at -O2 (the above checks
> > > flag_tree_slp_vectorize as well).  At least the cost model kind
> > > does not have any influence on BB vectorization, that is, we get the
> > > same pros and cons as we do for -O3.
> > >
> > > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> > >
> >
> >
> > Here is the measured performance speedup at O2 vect with
> > very cheap cost model on both Power8 and Power9.
> >
> > INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
> > FP: INT + -ffast-math
> >
> > Column titles are:
> >
> > <bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)
> >
> > Power8:
> > 500.perlbench_r         0.00% 0.00% 0.00%
> > 502.gcc_r               0.39% 0.78% 0.00%
> > 505.mcf_r               0.00% 0.00% 0.00%
> > 520.omnetpp_r           1.21% 0.30% 0.00%
> > 523.xalancbmk_r         0.00% 0.00% -0.57%
> > 525.x264_r              41.84%  42.55%  0.00%
> > 531.deepsjeng_r         0.00% -0.63%  0.00%
> > 541.leela_r             -3.44%  -2.75%  0.00%
> > 548.exchange2_r         1.66% 1.66% 0.00%
> > 557.xz_r                1.39% 1.04% 0.00%
> > Geomean                 3.67% 3.64% -0.06%
> >
> > 503.bwaves_r            0.00% 0.00% 0.00%
> > 507.cactuBSSN_r         0.00% 0.29% 0.44%
> > 508.namd_r              0.00% 0.29% 0.00%
> > 510.parest_r            0.00% -0.36%  -0.54%
> > 511.povray_r            0.63% 0.31% 0.94%
> > 519.lbm_r               2.71% 2.71% 0.00%
> > 521.wrf_r               1.04% 1.04% 0.00%
> > 526.blender_r           -1.31%  -0.78%  0.00%
> > 527.cam4_r              -0.62%  -0.31%  -0.62%
> > 538.imagick_r           0.21% 0.21% -0.21%
> > 544.nab_r               0.00% 0.00% 0.00%
> > 549.fotonik3d_r         0.00% 0.00% 0.00%
> > 554.roms_r              0.30% 0.00% 0.00%
> > Geomean                 0.22% 0.26% 0.00%
> >
> > Power9:
> >
> > 500.perlbench_r         0.62% 0.62% -1.54%
> > 502.gcc_r               -0.60%  -0.60%  -0.81%
> > 505.mcf_r               2.05% 2.05% 0.00%
> > 520.omnetpp_r           -2.41%  -0.30%  -0.60%
> > 523.xalancbmk_r         -1.44%  -2.30%  -1.44%
> > 525.x264_r              24.26%  23.93%  -0.33%
> > 531.deepsjeng_r         0.32% 0.32% 0.00%
> > 541.leela_r             0.39% 1.18% -0.39%
> > 548.exchange2_r         0.76% 0.76% 0.00%
> > 557.xz_r                0.36% 0.36% -0.36%
> > Geomean                 2.19% 2.38% -0.55%
> >
> > 503.bwaves_r            0.00% 0.36% 0.00%
> > 507.cactuBSSN_r         0.00% 0.00% 0.00%
> > 508.namd_r              -3.73%  -0.31%  -3.73%
> > 510.parest_r            -0.21%  -0.42%  -0.42%
> > 511.povray_r            -0.96%  -1.59%  0.64%
> > 519.lbm_r               2.31% 2.31% 0.17%
> > 521.wrf_r               2.66% 2.66% 0.00%
> > 526.blender_r           -1.96%  -1.68%  1.40%
> > 527.cam4_r              0.00% 0.91% 1.81%
> > 538.imagick_r           0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
> > 544.nab_r               0.25% 0.00% 0.00%
> > 549.fotonik3d_r         0.94% 0.94% 0.00%
> > 554.roms_r              0.00% 0.00% -1.05%
> > Geomean                 -0.03%  0.22% -0.93%
> >
> >
> > As above, the gains are mainly from loop vectorization.
> > btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
> >
> > BR,
> > Kewen
>
> Here is data on CLX.
> + for performance means better.
> - for codesize means better.
>
> we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which
> did not exist in our last measurement w/ gcc11.0.0 20210317, it's not
> related to the fix of  PR100089.
> others about the same as the last measurement.
>

                                O2 -ftree-vectorize very-cheap
                                codesize performance
500.perlbench_r                  0.34%         0.55%
502.gcc_r                        0.29%        -0.32%
505.mcf_r                        1.36%        -1.20%(noise)
520.omnetpp_r                   -0.65%        -0.83%
523.xalancbmk_r                  0.04%        -0.59%
525.x264_r                       1.29%        62.62%
531.deepsjeng_r                  0.18%        -0.44%
541.leela_r                     -1.10%        -0.12%
548.exchange2_r                 -1.19%         0.34%
557.xz_r                        -0.53%        -1.01%(cost model)
geomean for intrate              0.00%         4.60%
503.bwaves_r                    -0.29%        -1.19%
507.cactuBSSN_r                  0.01%        -0.55%
508.namd_r                      -0.61%         2.38%
510.parest_r                    -0.41%         0.10%
511.povray_r                    -1.76%         3.79%
519.lbm_r                        0.38%        -0.33%
521.wrf_r                       -0.85%         1.23%
526.blender_r                   -0.40%        -1.21%(nosie)
527.cam4_r                      -0.27%         0.06%
538.imagick_r                   -0.97%         1.10%
544.nab_r                       -0.65%         0.09%
549.fotonik3d_r                  3.36%         0.30%
554.roms_r                      -0.28%        -0.20%
geomean for fprate              -0.22%         0.42%
geomean                         -0.12%         2.22%

                           loop vectorizer                bb vectorizer
                           codesize performance       codesize performance
500.perlbench_r            0.05%        0.80%            0.29%        0.84%
502.gcc_r                  0.02%       -0.12%            0.27%       -0.23%
505.mcf_r                  0.00%       -0.69%            1.16%       -0.85%
520.omnetpp_r              0.05%       -0.97%           -0.70%       -0.52%
523.xalancbmk_r            0.26%       -0.56%           -0.04%       -0.52%
525.x264_r                 1.18%       64.80%            0.13%       -0.29%
531.deepsjeng_r            0.16%       -0.03%           -0.05%       -0.50%
541.leela_r               -0.11%        0.59%           -0.99%       -1.12%
548.exchange2_r           -0.27%       -0.29%           -1.02%        0.17%
557.xz_r                  -0.76%       -0.10%           -0.10%       -1.28%
geomean for intrate        0.06%        4.98%           -0.11%       -0.43%
503.bwaves_r               0.00%       -0.86%           -0.25%       -0.43%
507.cactuBSSN_r            0.01%       -0.35%            0.01%       -0.37%
508.namd_r                -0.13%       -0.09%           -0.67%        2.45%
510.parest_r              -0.16%        0.62%           -0.50%        0.72%
511.povray_r              -0.03%        0.41%           -1.74%        4.61%
519.lbm_r                  0.00%       -0.31%            0.38%        0.05%
521.wrf_r                 -0.03%        1.60%           -0.94%        0.00%
526.blender_r              0.00%       -1.49%           -0.43%       -1.64%
527.cam4_r                 0.10%       -0.06%           -0.39%       -0.01%
538.imagick_r             -0.09%        0.32%           -0.90%        2.49%
544.nab_r                  0.02%        0.20%           -0.69%        0.09%
549.fotonik3d_r            2.42%        0.44%            0.93%       -0.08%
554.roms_r                 0.25%        0.06%           -0.52%        0.00%
geomean for fprate         0.18%        0.04%           -0.44%        0.59%
geomean                    0.13%        2.16%           -0.30%        0.15%


-- 
BR,
Hongtao