Re: Why vectorization didn't turn on by -O2

Hongtao Liu via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 4 Aug 2021 17:12:11 +0800



On Wed, Aug 4, 2021 at 4:31 PM Richard Biener <rguenther@xxxxxxx> wrote:
>
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
> > Hongtao Liu <crazylht@xxxxxxxxx> writes:
> > > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
> > > <gcc-help@xxxxxxxxxxx> wrote:
> > >>
> > >> Jan Hubicka <hubicka@xxxxxx> writes:
> > >> > Hi,
> > >> > here are updated scores.
> > >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >> > compares
> > >> >   base:  mainline
> > >> >   1st column: mainline with very cheap vectorization at -O2 and -O3
> > >> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
> > >> >
> > >> > The short story is:
> > >> >
> > >> > 1) -O2 generic performance
> > >> >     kabylake (Intel):
> > >> >                               very    cheap
> > >> >         SPEC/SPEC2006/FP/total        ~       8.32%
> > >> >       SPEC/SPEC2006/total     -0.38%  4.74%
> > >> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
> > >> >
> > >> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
> > >> >       SPEC/SPEC2017/total     2.22%   6.52%
> > >> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
> > >> >     zen
> > >> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
> > >> >       SPEC/SPEC2006/total     0.26%   6.27%
> > >> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
> > >> >
> > >> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
> > >> >       SPEC/SPEC2017/total     3.02%   6.55%
> > >> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
> > >> >
> > >> >  2) -O2 size:
> > >> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
> > >> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
> > >> >  3) build times:
> > >> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
> > >> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
> > >> >     here I simply copied data from different configuratoins
> > >> >
> > >> > So for SPEC i would say that most of compile time costs are derrived
> > >> > from code size growth which is a problem with cheap model but not with
> > >> > very cheap.  Very cheap indeed results in code size improvements and
> > >> > compile time impact is probably somewhere around 0.5%
> > >> >
> > >> > So from these scores alone this would seem that vectorization makes
> > >> > sense at -O2 with very cheap model to me (I am sure we have other
> > >> > optimizations with worse benefits to compile time tradeoffs).
> > >>
> > >> Thanks for running these.
> > >>
> > >> The biggest issue I know of for enabling very-cheap at -O2 is:
> > >>
> > >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
> > >>
> > >> Perhaps we could get around that by (hopefully temporarily) disabling
> > >> BB SLP within loop vectorisation for the very-cheap model.  This would
> > >> purely be a workaround and we should remove it once the PR is fixed.
> > >> (It would even be a compile-time win in the meantime :-))
> > >>
> > >> Thanks,
> > >> Richard
> > >>
> > >> > However there are usual arguments against:
> > >> >
> > >> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
> > >> >      that argument is to enable it by default :)
> > >> >   2) Workloads improved are more of -Ofast type workloads
> > >> >
> > >> > Here are non-spec benchmarks we track:
> > >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
> > >> >
> > >> > I also tried to run Firefox some time ago. Results are not surprising -
> > >> > vectorizaiton helps rendering benchmarks which are those compiler with
> > >> > aggressive flags anyway.
> > >> >
> > >> > Honza
> > >
> > > Hi:
> > >   I would like to ask if we can turn on O2 vectorization now?
> >
> > I think we still need to deal with the PR100089 issue that I mentioned above.
> > Like I say, “dealing with” it could be as simple as disabling:
> >
> >       /* If we applied if-conversion then try to vectorize the
> >        BB of innermost loops.
> >        ???  Ideally BB vectorization would learn to vectorize
> >        control flow by applying if-conversion on-the-fly, the
> >        following retains the if-converted loop body even when
> >        only non-if-converted parts took part in BB vectorization.  */
> >       if (flag_tree_slp_vectorize != 0
> >         && loop_vectorized_call
> >         && ! loop->inner)
> >
> > for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.
>
> Did anyone benchmark -O2 -ftree-{loop,slp}-vectorize separately yet?
I can collect 4 sets of data including both codesize and performance on SPEC2017
1.  baseline: -O2
2.  baseline + both slp and loop vectorizer: O2 -ftree-vectorize
-fvect-cost-model=very-cheap.
3.  baseline + only loop vectorizer: O2 -ftree-loop-vectorize
-fvect-cost-model=very-cheap.
4.  baseline + only bb vectorizer: O2 -ftree-slp-vectorize.
>
> Richard.


-- 
BR,
Hongtao