Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 07 Dec 2021 12:08:05 -0500

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> writes:
> On Tue, 2021-11-30 at 22:38 +0100, pgsql-performance@xxxxxxxxxx wrote:
>> INSERT INTO song (artist, title)
>> SELECT 'artist','title'
>> FROM generate_series(1,10000);
>>
>> \set query '12345678'
>> 
>> -- This query is slow
>> EXPLAIN ANALYZE
>> SELECT song.artist, song.title
>> FROM song
>> WHERE (song.artist %> :'query' OR song.title %> :'query')
>> ;

> The table is quite small; with a bigger table, the test would be more meaningful.

Yeah, this test case seems very unrealistic, both as to table size
and as to the lack of variability of the table entries.  I think the
latter is causing the indexscans to take less time than they otherwise
might, because none of the extracted trigrams find any matches.

> Since you have SSDs, you should tune "random_page_cost = 1.1".

Right.  Poking at gincostestimate a bit, I see that for this
operator the indexscan cost estimate is basically driven by the
number of trigrams extracted from the query string (nine in this
test case) and the index size; those lead to a predicted number
of index page fetches that's then scaled by random_page_cost.
That's coming out to make it look more expensive than the seqscan.
It's actually not more expensive, but that's partially because
page fetch costs are really zero in this test case (everything
will stay in shared buffers the whole time), and partially because
the unrealistic data pattern is leading to not having to look at
as much of the index as gincostestimate expected.

In general, it appears correct that longer query strings lead to a
higher index cost estimate, because they produce more trigrams so
there's more work for the index match to do.  (At some level, a
longer query means more work in the seqscan case too; but our cost
models are inadequate to predict that.)

			regards, tom lane