Re: Why vectorization didn't turn on by -O2

Richard Sandiford via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 04 Aug 2021 10:10:36 +0100

Richard Biener <rguenther@xxxxxxx> writes:
> On Wed, 4 Aug 2021, Richard Sandiford wrote:
>
>> Hongtao Liu <crazylht@xxxxxxxxx> writes:
>> > On Tue, May 18, 2021 at 4:27 AM Richard Sandiford via Gcc-help
>> > <gcc-help@xxxxxxxxxxx> wrote:
>> >>
>> >> Jan Hubicka <hubicka@xxxxxx> writes:
>> >> > Hi,
>> >> > here are updated scores.
>> >> > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >> > compares
>> >> >   base:  mainline
>> >> >   1st column: mainline with very cheap vectorization at -O2 and -O3
>> >> >   2nd column: mainline with cheap vectorization at -O2 and -O3.
>> >> >
>> >> > The short story is:
>> >> >
>> >> > 1) -O2 generic performance
>> >> >     kabylake (Intel):
>> >> >                               very    cheap
>> >> >         SPEC/SPEC2006/FP/total        ~       8.32%
>> >> >       SPEC/SPEC2006/total     -0.38%  4.74%
>> >> >       SPEC/SPEC2006/INT/total -0.91%  -0.14%
>> >> >
>> >> >       SPEC/SPEC2017/INT/total 4.71%   7.11%
>> >> >       SPEC/SPEC2017/total     2.22%   6.52%
>> >> >       SPEC/SPEC2017/FP/total  0.34%   6.06%
>> >> >     zen
>> >> >         SPEC/SPEC2006/FP/total        0.61%   10.23%
>> >> >       SPEC/SPEC2006/total     0.26%   6.27%
>> >> >       SPEC/SPEC2006/INT/total 34.006  -0.24%  0.90%
>> >> >
>> >> >         SPEC/SPEC2017/INT/total       3.937   5.34%   7.80%
>> >> >       SPEC/SPEC2017/total     3.02%   6.55%
>> >> >       SPEC/SPEC2017/FP/total  1.26%   5.60%
>> >> >
>> >> >  2) -O2 size:
>> >> >      -0.78% (very cheap) 6.51% (cheap) for spec2k2006
>> >> >      -0.32% (very cheap) 6.75% (cheap) for spec2k2017
>> >> >  3) build times:
>> >> >      0%, 0.16%, 0.71%, 0.93% (very cheap) 6.05% 4.80% 6.75% 7.15% (cheap) for spec2k2006
>> >> >      0.39% 0.57% 0.71%       (very cheap) 5.40% 6.23% 8.44%       (cheap) for spec2k2017
>> >> >     here I simply copied data from different configuratoins
>> >> >
>> >> > So for SPEC i would say that most of compile time costs are derrived
>> >> > from code size growth which is a problem with cheap model but not with
>> >> > very cheap.  Very cheap indeed results in code size improvements and
>> >> > compile time impact is probably somewhere around 0.5%
>> >> >
>> >> > So from these scores alone this would seem that vectorization makes
>> >> > sense at -O2 with very cheap model to me (I am sure we have other
>> >> > optimizations with worse benefits to compile time tradeoffs).
>> >>
>> >> Thanks for running these.
>> >>
>> >> The biggest issue I know of for enabling very-cheap at -O2 is:
>> >>
>> >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
>> >>
>> >> Perhaps we could get around that by (hopefully temporarily) disabling
>> >> BB SLP within loop vectorisation for the very-cheap model.  This would
>> >> purely be a workaround and we should remove it once the PR is fixed.
>> >> (It would even be a compile-time win in the meantime :-))
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> > However there are usual arguments against:
>> >> >
>> >> >   1) Vectorizer being tuned for SPEC.  I think the only way to overcome
>> >> >      that argument is to enable it by default :)
>> >> >   2) Workloads improved are more of -Ofast type workloads
>> >> >
>> >> > Here are non-spec benchmarks we track:
>> >> > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&min_percentage_change=0.02&revisions=9388fc7bf0da61a8104e8501e5965120e9159e12%2Cea21f32198432a490dd490696322838d94b3d3b2%2C4f5431c5768bbba81a422f6fed6a6e2454c700ee%2C&include_user_branches=on
>> >> >
>> >> > I also tried to run Firefox some time ago. Results are not surprising -
>> >> > vectorizaiton helps rendering benchmarks which are those compiler with
>> >> > aggressive flags anyway.
>> >> >
>> >> > Honza
>> >
>> > Hi:
>> >   I would like to ask if we can turn on O2 vectorization now?
>> 
>> I think we still need to deal with the PR100089 issue that I mentioned above.
>> Like I say, “dealing with” it could be as simple as disabling:
>> 
>>       /* If we applied if-conversion then try to vectorize the
>> 	 BB of innermost loops.
>> 	 ???  Ideally BB vectorization would learn to vectorize
>> 	 control flow by applying if-conversion on-the-fly, the
>> 	 following retains the if-converted loop body even when
>> 	 only non-if-converted parts took part in BB vectorization.  */
>>       if (flag_tree_slp_vectorize != 0
>> 	  && loop_vectorized_call
>> 	  && ! loop->inner)
>> 
>> for the very-cheap vector cost model until the PR is fixed properly.
>
> Alternatively only enable loop vectorization at -O2 (the above checks
> flag_tree_slp_vectorize as well).  At least the cost model kind
> does not have any influence on BB vectorization, that is, we get the
> same pros and cons as we do for -O3.

Yeah, but a lot of the loop vector cost model choice is about controlling
code size growth and avoiding excessive runtime versioning tests.  BB SLP
should be a win on both code size and performance (barring significant
target costing issues).

PR100089 was an exception because we ended up keeping unvectorised
scalar code that would never have existed otherwise.  BB SLP proper
shouldn't have that problem.

Thanks,
Richard