Re: Why vectorization didn't turn on by -O2

Hongtao Liu via Gcc-help <gcc-help@xxxxxxxxxxx> · Mon, 16 Aug 2021 14:00:36 +0800

On Mon, Aug 16, 2021 at 11:23 AM Kewen.Lin via Gcc-help
<gcc-help@xxxxxxxxxxx> wrote:
>
> on 2021/8/4 下午4:31, Richard Biener wrote:
> > On Wed, 4 Aug 2021, Richard Sandiford wrote:
> >
> >> Hongtao Liu <crazylht@xxxxxxxxx> writes:
> >>> On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> >>> <gcc-help@xxxxxxxxxxx> wrote:
> >>>>
> >>>> Jan Hubicka <hubicka@xxxxxx> writes:
> >>>>> Hi,
> >>>>> here are updated scores.
> >>>>> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >>>>> compares
> >>>>>   base:  mainline
> >>>>>   1st column: mainline with very cheap vectorization at -O2 and -O3
> >>>>>   2nd column: mainline with cheap vectorization at -O2 and -O3.
> >>>>>
> >>>>> The short story is:
> >>>>>
> >>>>> 1) -O2 generic performance
> >>>>>     kabylake (Intel):
> >>>>>                               very    cheap
> >>>>>         SPEC/SPEC2006/FP/total        ~       8.32%
> >>>>>       SPEC/SPEC2006/total     -0.38%  4.74%
> >>>>>       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> >>>>>
> >>>>>       SPEC/SPEC2017/INT/total 4.71%   7.11%
> >>>>>       SPEC/SPEC2017/total     2.22%   6.52%
> >>>>>       SPEC/SPEC2017/FP/total  0.34%   6.06%
> >>>>>     zen
> >>>>>         SPEC/SPEC2006/FP/total        0.61%   10.23%
> >>>>>       SPEC/SPEC2006/total     0.26%   6.27%
> >>>>>       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> >>>>>
> >>>>>         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> >>>>>       SPEC/SPEC2017/total     3.02%   6.55%
> >>>>>       SPEC/SPEC2017/FP/total  1.26%   5.60%
> >>>>>
> >>>>>  2) -O2 size:
> >>>>>      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> >>>>>      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> >>>>>  3) build times:
> >>>>>      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> >>>>>      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> >>>>>     here I simply copied data from different configuratoins
> >>>>>
> >>>>> So for SPEC i would say that most of compile time costs are derrived
> >>>>> from code size growth which is a problem with cheap model but not with
> >>>>> very cheap.  Very cheap indeed results in code size improvements and
> >>>>> compile time impact is probably somewhere around 0.5%
> >>>>>
> >>>>> So from these scores alone this would seem that vectorization makes
> >>>>> sense at -O2 with very cheap model to me (I am sure we have other
> >>>>> optimizations with worse benefits to compile time tradeoffs).
> >>>>
> >>>> Thanks for running these.
> >>>>
> >>>> The biggest issue I know of for enabling very-cheap at -O2 is:
> >>>>
> >>>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> >>>>
> >>>> Perhaps we could get around that by (hopefully temporarily) disabling
> >>>> BB SLP within loop vectorisation for the very-cheap model.  This would
> >>>> purely be a workaround and we should remove it once the PR is fixed.
> >>>> (It would even be a compile-time win in the meantime :-))
> >>>>
> >>>> Thanks,
> >>>> Richard
> >>>>
> >>>>> However there are usual arguments against:
> >>>>>
> >>>>>   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> >>>>>      that argument is to enable it by default :)
> >>>>>   2) Workloads improved are more of -Ofast type workloads
> >>>>>
> >>>>> Here are non-spec benchmarks we track:
> >>>>> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> >>>>>
> >>>>> I also tried to run Firefox some time ago. Results are not surprising -
> >>>>> vectorizaiton helps rendering benchmarks which are those compiler with
> >>>>> aggressive flags anyway.
> >>>>>
> >>>>> Honza
> >>>
> >>> Hi:
> >>>   I would like to ask if we can turn on O2 vectorization now?
> >>
> >> I think we still need to deal with the PR100089 issue that I mentioned above.
> >> Like I say, “dealing with” it could be as simple as disabling:
> >>
> >>       /* If we applied if-conversion then try to vectorize the
> >>       BB of innermost loops.
> >>       ???  Ideally BB vectorization would learn to vectorize
> >>       control flow by applying if-conversion on-the-fly, the
> >>       following retains the if-converted loop body even when
> >>       only non-if-converted parts took part in BB vectorization.  */
> >>       if (flag_tree_slp_vectorize != 0
> >>        && loop_vectorized_call
> >>        && ! loop->inner)
> >>
> >> for the very-cheap vector cost model until the PR is fixed properly.
> >
> > Alternatively only enable loop vectorization at -O2 (the above checks
> > flag_tree_slp_vectorize as well).  At least the cost model kind
> > does not have any influence on BB vectorization, that is, we get the
> > same pros and cons as we do for -O3.
> >
> > Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
> >
>
>
> Here is the measured performance speedup at O2 vect with
> very cheap cost model on both Power8 and Power9.
>
> INT: -O2 -mcpu=power{8,9} -ftree-{,loop-,slp-}vectorize -fvect-cost-model=very-cheap
> FP: INT + -ffast-math
>
> Column titles are:
>
> <bmks>  <both loop and slp>  <loop only>  <slp only> (+:improvement, -:degradation)
>
> Power8:
> 500.perlbench_r         0.00% 0.00% 0.00%
> 502.gcc_r               0.39% 0.78% 0.00%
> 505.mcf_r               0.00% 0.00% 0.00%
> 520.omnetpp_r           1.21% 0.30% 0.00%
> 523.xalancbmk_r         0.00% 0.00% -0.57%
> 525.x264_r              41.84%  42.55%  0.00%
> 531.deepsjeng_r         0.00% -0.63%  0.00%
> 541.leela_r             -3.44%  -2.75%  0.00%
> 548.exchange2_r         1.66% 1.66% 0.00%
> 557.xz_r                1.39% 1.04% 0.00%
> Geomean                 3.67% 3.64% -0.06%
>
> 503.bwaves_r            0.00% 0.00% 0.00%
> 507.cactuBSSN_r         0.00% 0.29% 0.44%
> 508.namd_r              0.00% 0.29% 0.00%
> 510.parest_r            0.00% -0.36%  -0.54%
> 511.povray_r            0.63% 0.31% 0.94%
> 519.lbm_r               2.71% 2.71% 0.00%
> 521.wrf_r               1.04% 1.04% 0.00%
> 526.blender_r           -1.31%  -0.78%  0.00%
> 527.cam4_r              -0.62%  -0.31%  -0.62%
> 538.imagick_r           0.21% 0.21% -0.21%
> 544.nab_r               0.00% 0.00% 0.00%
> 549.fotonik3d_r         0.00% 0.00% 0.00%
> 554.roms_r              0.30% 0.00% 0.00%
> Geomean                 0.22% 0.26% 0.00%
>
> Power9:
>
> 500.perlbench_r         0.62% 0.62% -1.54%
> 502.gcc_r               -0.60%  -0.60%  -0.81%
> 505.mcf_r               2.05% 2.05% 0.00%
> 520.omnetpp_r           -2.41%  -0.30%  -0.60%
> 523.xalancbmk_r         -1.44%  -2.30%  -1.44%
> 525.x264_r              24.26%  23.93%  -0.33%
> 531.deepsjeng_r         0.32% 0.32% 0.00%
> 541.leela_r             0.39% 1.18% -0.39%
> 548.exchange2_r         0.76% 0.76% 0.00%
> 557.xz_r                0.36% 0.36% -0.36%
> Geomean                 2.19% 2.38% -0.55%
>
> 503.bwaves_r            0.00% 0.36% 0.00%
> 507.cactuBSSN_r         0.00% 0.00% 0.00%
> 508.namd_r              -3.73%  -0.31%  -3.73%
> 510.parest_r            -0.21%  -0.42%  -0.42%
> 511.povray_r            -0.96%  -1.59%  0.64%
> 519.lbm_r               2.31% 2.31% 0.17%
> 521.wrf_r               2.66% 2.66% 0.00%
> 526.blender_r           -1.96%  -1.68%  1.40%
> 527.cam4_r              0.00% 0.91% 1.81%
> 538.imagick_r           0.39% -0.19%  -10.29%  // known noise, imagick_r can have big jitter on P9 box sometimes.
> 544.nab_r               0.25% 0.00% 0.00%
> 549.fotonik3d_r         0.94% 0.94% 0.00%
> 554.roms_r              0.00% 0.00% -1.05%
> Geomean                 -0.03%  0.22% -0.93%
>
>
> As above, the gains are mainly from loop vectorization.
> btw, Power8 data can be more representative since some bmks can have jitters on our P9 perf box.
>
> BR,
> Kewen

Here is data on CLX.
+ for performance means better.
- for codesize means better.

we notice there's a codesize increase in 549.fotonik3d_r(3.36%) which
did not exist in our last measurement w/ gcc11.0.0 20210317, it's not
related to the fix of  PR100089.
others about the same as the last measurement.

                                O2 -ftree-vectorize very-cheap
loop vectorizer                bb vectorizer
codesize performce                codesize performance
codesize performance                codesize performance
500.perlbench_r                  0.34%        0.55%
    0.05%        0.80%                        0.29%        0.84%
502.gcc_r                        0.29%        -0.32%
     0.02%        -0.12%                        0.27%        -0.23%
505.mcf_r                        1.36%        -1.20%(noise)
    0.00%        -0.69%                        1.16%        -0.85%
520.omnetpp_r                    -0.65%        -0.83%
      0.05%        -0.97%                        -0.70%        -0.52%
523.xalancbmk_r                  0.04%        -0.59%
     0.26%        -0.56%                        -0.04%        -0.52%
525.x264_r                       1.29%        62.62%
     1.18%        64.80%                        0.13%        -0.29%
531.deepsjeng_r                  0.18%        -0.44%
     0.16%        -0.03%                        -0.05%        -0.50%
541.leela_r                      -1.10%        -0.12%
      -0.11%        0.59%                        -0.99%        -1.12%
548.exchange2_r                  -1.19%        0.34%
     -0.27%        -0.29%                        -1.02%        0.17%
557.xz_r                         -0.53%        -1.01%(cost model)
  -0.76%        -0.10%                        -0.10%        -1.28%
geomean for intrate                0.00%        4.60%
      0.06%        4.98%                        -0.11%        -0.43%
503.bwaves_r                   -0.29%        -1.19% (noise)
  0.00%        -0.86%                        -0.25%        -0.43%
507.cactuBSSN_r                0.01%        -0.55%
   0.01%        -0.35%                        0.01%        -0.37%
508.namd_r                     -0.61%        2.38%
   -0.13%        -0.09%                        -0.67%        2.45%
510.parest_r                   -0.41%        0.10%
   -0.16%        0.62%                        -0.50%        0.72%
511.povray_r                   -1.76%        3.79%
   -0.03%        0.41%                        -1.74%        4.61%
519.lbm_r                      0.38%        -0.33%
   0.00%        -0.31%                        0.38%        0.05%
521.wrf_r                      -0.85%        1.23%
   -0.03%        1.60%                        -0.94%        0.00%
526.blender_r                  -0.40%        -1.21%(nosie)
   0.00%        -1.49%                        -0.43%        -1.64%
527.cam4_r                     -0.27%        0.06%
   0.10%        -0.06%                        -0.39%        -0.01%
538.imagick_r                  -0.97%        1.10%
   -0.09%        0.32%                        -0.90%        2.49%
544.nab_r                      -0.65%        0.09%
   0.02%        0.20%                        -0.69%        0.09%
549.fotonik3d_r                3.36%        0.30%
  2.42%        0.44%                        0.93%        -0.08%
554.roms_r                     -0.28%        -0.20%
    0.25%        0.06%                        -0.52%        0.00%
geomean for fprate                -0.22%        0.42%
      0.18%        0.04%                        -0.44%        0.59%
geomean                        -0.12%        2.22%
   0.13%        2.16%                        -0.30%        0.15%

-- 
BR,
Hongtao