Re: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

David Rowley <dgrowleyml@xxxxxxxxx> · Thu, 9 Feb 2023 14:30:00 +1300

On Thu, 9 Feb 2023 at 13:05, Martin L. Buchanan
<martinlbuchanan@xxxxxxxxx> wrote:
> For the common and simple cases of find this string anywhere in another string:
>
> str LIKE '%foo%'
>
> str ILIKE  '%foo%'
>
> position('foo' in str) > 0
>
> Is Boyer-Moore string searching now used by any of these three?

We use a sort of "lossy" Boyer-Moore-Horspool algorithm. See
text_position_setup() and text_position_next() in varlena.c (the lossy
part comes from the skiptables possibly sharing the same entry for
multiple characters as defined by what skiptablemask is set to).

> I checked the PG documentation and found no info about this other than what was in the Todo wiki, https://wiki.postgresql.org/wiki/Todo, under Functions. Tom Lane gave a thumbs down to the idea back in 2008, but that was a long time ago: https://www.postgresql.org/message-id/27645.1220635769@xxxxxxxxxxxxx .

Tom's argument seems to think it's impossible, so if you find that
it's definitely not impossible, then you can assume he's wrong about
that.  I've not given it much thought, but I think providing there are
only leading and trailing %'s and no _'s that the LIKE would be
equivalent to checking if position(like_arg2 in like_arg1).

There are probably 2 ways that you could consider going about making this work:

Method 1: See if it's possible to rewrite the query to swap the LIKE
for position().

I'm really unsure where you'd want to put the logic that looks for
such patterns. I don't recall any code that swaps one operator out for
another.  It seems like you could only apply such a transformation if
the like pattern was a Const.  I'm not sure if we'd want to do
anything like introduce some special logic in oper() in parse_oper.c
to do such a rewrite. Given we allow users to define their own
datatypes and operators, it would seem a bit limiting to go and hard
code this transformation. I'm really not sure what it would take to
make such a thing expandable. Maybe some kind of rewrite function
could be defined in pg_operator that can try and fail to come up with
some more efficient alternative OpExpr.

You'd also need to consider that conditions such as: str LIKE 'foo%'
can use an index scan in some cases. If you were to rewrite in that
case you'd likely kill the performance of queries that could exploit
that feature.

Method 2: Adjust like_match.c to look for these patterns and use BMH
when possible.

The problem with this is that you'd need to check the pattern each
call to see if it's compatible. Since you need to check for %'s and
_'s in the middle of the string, then that means adding a complete
preprocessing step that would need to be done on every call to
MatchText(). If that passes then it's maybe going to be faster, but if
it fails then it'll certainly be slower. You might very well struggle
to convince people that this would be a good idea. It sounds like it
would be very easy to demonstrate performance regressions here just by
passing a pattern that always fails that preprocess step. Getting
stuff committed that causes performance regressions in some cases and
wins in other cases can be a pretty difficult and frustrating process.
You have to remember, even if you think the slowdown is some corner
case that only applies ~1% of the time, for some users in the real
world, that might be 100% of their queries.

There are probably other ways you could consider doing this, I just
can't think of them right now.

David