Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters

pgsql-performance@xxxxxxxxxx · Wed, 8 Dec 2021 19:43:03 +0100

Thank you both a lot for the insights and your input.

> Yeah, this test case seems very unrealistic, both as to table size
> and as to the lack of variability of the table entries.

The example was based on real data with a more complicated query which
prompted me to investigate the issue. The distinction between slow and
fast queries is not as clear cut as with the generated data, but the
general problem remains.

>> Since you have SSDs, you should tune "random_page_cost = 1.1".

I tested different values of random_page_cost with various queries. Too
small values increased the execution time again, due to too eager index
usage. I identified the optimum for my use case at 1.4. This solved my
problem, thanks.

Regards
Jonathan

On 07.12.21 18:08, Tom Lane wrote:
Laurenz Albe <laurenz.albe@xxxxxxxxxxx> writes:
On Tue, 2021-11-30 at 22:38 +0100, pgsql-performance@xxxxxxxxxx wrote:
INSERT INTO song (artist, title)
SELECT 'artist','title'
FROM generate_series(1,10000);

\set query '12345678'

-- This query is slow
EXPLAIN ANALYZE
SELECT song.artist, song.title
FROM song
WHERE (song.artist %> :'query' OR song.title %> :'query')
;

The table is quite small; with a bigger table, the test would be more meaningful.

Yeah, this test case seems very unrealistic, both as to table size
and as to the lack of variability of the table entries.  I think the
latter is causing the indexscans to take less time than they otherwise
might, because none of the extracted trigrams find any matches.

Since you have SSDs, you should tune "random_page_cost = 1.1".

Right.  Poking at gincostestimate a bit, I see that for this
operator the indexscan cost estimate is basically driven by the
number of trigrams extracted from the query string (nine in this
test case) and the index size; those lead to a predicted number
of index page fetches that's then scaled by random_page_cost.
That's coming out to make it look more expensive than the seqscan.
It's actually not more expensive, but that's partially because
page fetch costs are really zero in this test case (everything
will stay in shared buffers the whole time), and partially because
the unrealistic data pattern is leading to not having to look at
as much of the index as gincostestimate expected.

In general, it appears correct that longer query strings lead to a
higher index cost estimate, because they produce more trigrams so
there's more work for the index match to do.  (At some level, a
longer query means more work in the seqscan case too; but our cost
models are inadequate to predict that.)

			regards, tom lane