Re: Why vectorization didn't turn on by -O2

Richard Sandiford via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 04 Aug 2021 11:22:53 +0100

Segher Boessenkool <segher@xxxxxxxxxxxxxxxxxxx> writes:
> On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
>> Richard Biener <rguenther@xxxxxxx> writes:
>> > Alternatively only enable loop vectorization at -O2 (the above checks
>> > flag_tree_slp_vectorize as well).  At least the cost model kind
>> > does not have any influence on BB vectorization, that is, we get the
>> > same pros and cons as we do for -O3.
>> 
>> Yeah, but a lot of the loop vector cost model choice is about controlling
>> code size growth and avoiding excessive runtime versioning tests.
>
> Both of those depend a lot on the target, and target-specific conditions
> as well (which CPU model is selected for example).  Can we factor that
> in somehow?  Maybe we need some target hook that returns the expected
> percentage code growth for vectorising a given loop, for example, and
> -O2 vs. -O3 then selects what percentage is acceptable.
>
>> BB SLP
>> should be a win on both code size and performance (barring significant
>> target costing issues).
>
> Yeah -- but this could use a similar hook as well (just a straightline
> piece of code instead of a loop).

I think anything like that should be driven by motivating use cases.
It's not something that we can easily decide in the abstract.

The results so far with using very-cheap at -O2 have been promising,
so I don't think new hooks should block that becoming the default.

>> PR100089 was an exception because we ended up keeping unvectorised
>> scalar code that would never have existed otherwise.  BB SLP proper
>> shouldn't have that problem.
>
> It also is a tiny piece of code.  There will always be tiny examples
> that are much worse (or much better) than average.

Yeah, what makes PR100089 important isn't IMO the test itself, but the
underlying problem that the PR exposed.  Enabling this “BB SLP in loop
vectorisation” code can lead to the generation of scalar COND_EXPRs even
though we know that ifcvt doesn't have a proper cost model for deciding
whether scalar COND_EXPRs are a win.

Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
(although still dubious), but I think it's something we need to avoid
for -O2, even if that means losing the optimisation.

Thanks,
Richard