Re: Why vectorization didn't turn on by -O2

Segher Boessenkool <segher@xxxxxxxxxxxxxxxxxxx> · Wed, 4 Aug 2021 16:18:12 -0500

On Wed, Aug 04, 2021 at 11:22:53AM +0100, Richard Sandiford wrote:
> Segher Boessenkool <segher@xxxxxxxxxxxxxxxxxxx> writes:
> > On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> >> Richard Biener <rguenther@xxxxxxx> writes:
> >> > Alternatively only enable loop vectorization at -O2 (the above checks
> >> > flag_tree_slp_vectorize as well).  At least the cost model kind
> >> > does not have any influence on BB vectorization, that is, we get the
> >> > same pros and cons as we do for -O3.
> >> 
> >> Yeah, but a lot of the loop vector cost model choice is about controlling
> >> code size growth and avoiding excessive runtime versioning tests.
> >
> > Both of those depend a lot on the target, and target-specific conditions
> > as well (which CPU model is selected for example).  Can we factor that
> > in somehow?  Maybe we need some target hook that returns the expected
> > percentage code growth for vectorising a given loop, for example, and
> > -O2 vs. -O3 then selects what percentage is acceptable.
> >
> >> BB SLP
> >> should be a win on both code size and performance (barring significant
> >> target costing issues).
> >
> > Yeah -- but this could use a similar hook as well (just a straightline
> > piece of code instead of a loop).
> 
> I think anything like that should be driven by motivating use cases.
> It's not something that we can easily decide in the abstract.
> 
> The results so far with using very-cheap at -O2 have been promising,
> so I don't think new hooks should block that becoming the default.

Right, but it wouldn't hurt to think a sec if we are on the right path
forward.  It's is crystal clear that to make good decisions about what
and how to vectorise you need to take *some* target characteristics into
account, and that will have to happen sooner rather than later.

This was all in reply to

> >> Yeah, but a lot of the loop vector cost model choice is about controlling
> >> code size growth and avoiding excessive runtime versioning tests.

It was not meant to hold up these patches :-)

> >> PR100089 was an exception because we ended up keeping unvectorised
> >> scalar code that would never have existed otherwise.  BB SLP proper
> >> shouldn't have that problem.
> >
> > It also is a tiny piece of code.  There will always be tiny examples
> > that are much worse (or much better) than average.
> 
> Yeah, what makes PR100089 important isn't IMO the test itself, but the
> underlying problem that the PR exposed.  Enabling this “BB SLP in loop
> vectorisation” code can lead to the generation of scalar COND_EXPRs even
> though we know that ifcvt doesn't have a proper cost model for deciding
> whether scalar COND_EXPRs are a win.
> 
> Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
> (although still dubious), but I think it's something we need to avoid
> for -O2, even if that means losing the optimisation.

Yeah -- -O2 should almost always do the right thing, while -O3 can do
bad things more often, it just has to be better "on average".

Segher