Re: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

Tom Lane <tgl@xxxxxxxxxxxxx> · Wed, 08 Feb 2023 20:49:43 -0500

David Rowley <dgrowleyml@xxxxxxxxx> writes:
> On Thu, 9 Feb 2023 at 13:05, Martin L. Buchanan
> <martinlbuchanan@xxxxxxxxx> wrote:
>> str LIKE '%foo%'
>> str ILIKE  '%foo%'
>> position('foo' in str) > 0
>> Is Boyer-Moore string searching now used by any of these three?

> We use a sort of "lossy" Boyer-Moore-Horspool algorithm. See
> text_position_setup() and text_position_next() in varlena.c (the lossy
> part comes from the skiptables possibly sharing the same entry for
> multiple characters as defined by what skiptablemask is set to).

Note that that's used only by the functions in that file; thus,
position() yes, but (I)LIKE no.

> Tom's argument seems to think it's impossible, so if you find that
> it's definitely not impossible, then you can assume he's wrong about
> that.

My point was that it seems like you'd need a separate BMH engine for
each %-separated segment of the LIKE pattern.  I'm not quite clear on
whether BMH can handle '_' (single-char wildcard) conveniently by
itself, although my gut feel is that you can probably make that part
work.  Maybe you can even extend the idea to embedded %, but that
seems more difficult.

Given that the fixed substrings of LIKE patterns are usually rather
short, I'm unconvinced that BMH would buy much compared to its
setup costs.  But hey, prove me wrong.

(One way to amortize the setup costs could be to cache pre-built
BMH data structures keyed by the pattern strings, in a similar fashion
to what utils/adt/regexp.c does for compiled regular expressions.)

> ... Getting
> stuff committed that causes performance regressions in some cases and
> wins in other cases can be a pretty difficult and frustrating process.
> You have to remember, even if you think the slowdown is some corner
> case that only applies ~1% of the time, for some users in the real
> world, that might be 100% of their queries.

Yeah, whatever the shape of the preprocessing might be, it seems
likely that it would be a net loss in some use-cases.  We do manage
to get past that --- the position() code didn't have BMH to start
with --- but it definitely requires solid evidence.

			regards, tom lane