Re: NOT LIKE much faster than LIKE?

Andrea Arcangeli <andrea@xxxxxxxxxxxx> · Tue, 10 Jan 2006 03:23:03 +0100

On Mon, Jan 09, 2006 at 09:04:48PM -0500, Tom Lane wrote:
> Andrea Arcangeli <andrea@xxxxxxxxxxxx> writes:
> > It just makes no sense to me that the planner takes a difference
> > decision based on a "not".
> 
> Why in the world would you think that?  In general a NOT will change the
> selectivity of the WHERE condition tremendously.  If the planner weren't
> sensitive to that, *that* would be a bug.  The only case where it's
> irrelevant is if the selectivity of the base condition is exactly 50%,
> which is not a very reasonable default guess for LIKE.

How do you know that "LIKE" will have a selectivity above 50% in the
first place? I think 50% should be the default unless the selectively is
measured at runtime against the data being queried.

If you don't know the data, I think it's a bug that LIKE is assumed to
have a selectivity above 50%. You can't know that, only the author of
the code can know that and that's why I talked about hints. It'd be fine
to give hints like:

	UNLIKELY string LIKE '%% PREEMPT %%'

or:

	LIKELY string NOT LIKE '%% PREEMPT %%'

Then you could assume that very little data will be returned or a lot of
data will be returned. 

If you don't get hints NOT LIKE or LIKE should be assumed to have the
same selectivity.

> It sounds to me that the problem is misestimation of the selectivity
> of the LIKE pattern --- the planner is going to think that
> LIKE '%% PREEMPT %%' is fairly selective because of the rather long
> match text, when in reality it's probably not so selective on your
> data.  But we don't keep any statistics that would allow the actual

True, there's a lot of data that matches %% PREEMPT %% (even if less
than the NOT case).

> number of matching rows to be estimated well.  You might want to think
> about changing your data representation so that the pattern-match can be
> replaced by a boolean column, or some such, so that the existing
> statistics code can make a more reasonable estimate how many rows are
> selected.

I see. I can certainly fix it by stopping using LIKE. But IMHO this
remains a bug, since until the statistics about the numberof matching
rows isn't estimated well, you should not make assumptions on LIKE/NOT
LIKE. I think you can change the code in a way that I won't have to
touch anything, and this will lead to fewer surprises in the future IMHO.

Thanks!