Re: [pg_trgm] Making similarity(?, ?) < ? use an index

Jeff Janes <jeff.janes@xxxxxxxxx> · Sat, 4 Jun 2016 11:48:10 -0700

On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact@xxxxxxxxxxxxx> wrote:
> Thanks for your replies.
>
> Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> which of course is completely equivalent to `lhs % rhs`, I wanted to write
> `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> with ternary operators is quite a bit of work. I might have a simpler idea:

The problem is that you would have to build your index on the
expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
would have to be a column, and the other would have to be a constant
string at the time you define the index, which is the same constant
string as specified when you want to use the index.  This could work
as long as there are only a few strings you ever care about checking
similarity to, and you would build an index for each one.  These would
be ordinary btree indexes, not gin indexes.   I've done this before,
only using substring rather than similarity.

> pg_trgm also provides `<->` but it seems this operator doesn't use indexes
> either. It seems the shortest path to per-query thresholds, without
> compromising the design, is making this operator use the index.

To be efficient, the index has to know where to start and stop its
scan.  Such a thing doesn't exists for "a<->b", there is no stopping
point.  It would have to be "a <-> b < c", at which point you are back
to ternary again.

> Please help
> me understand whether my reasoning is correct. If it is, I'd appreciate a
> high-level overview of what needs to be done. I can block a few hours to
> work on this in the upcoming weeks.

I think that you are proposing rewriting the indexing architecture
from scratch.  Doing that in a way that is robust and tested and
backward compatible would  probably take years, not hours, for a
single individual who is not already extensively experienced in the
internals of the code.  And, it is not really clear even what you want
to rewrite it into.

Tom's proposal is to do it in a way that works with the existing
architecture, rather than trying to re-write that architecture.
(Which could probably not be done in a few hours, either, especially
not once you write documentation and upgrade scripts and all the other
stuff that goes into production-quality code.)

I don't know if this would even be appropriate as an addition to
pg_trgm.  We might want to fork that code instead.  That would be a
shame, because the underlying c code would be the fundamentally the
same, but the alternative would be to force people who like % and
set_limit() to carry around the baggage of new operators and types
they have no interest in using, and vice versa.  True, we did just add
several new functions and operators to pg_trgm that many people will
have no interest in, so maybe that is not a big deal.

Cheers,

Jeff

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general