Re: Question regarding specifics of GIN and pg_trgm performance and potential use of show_trgm to improve it

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> · Thu, 25 May 2023 14:00:09 +0200

On 5/24/23 22:28, Pavel Horal wrote:
> Hello PostgreSQL community,
> 
> I am trying to improve performance of using similarity based queries on
> a large datasets and I would like to confirm my understanding of how GIN
> indexes work and how pg_trgm uses them.
> 
> If I understand it correctly, using GIN index is always a two step
> process: first, potential matches are searched in the index; then as a
> second step tuples are actually fetched and rechecked if they really
> match the query. This two step process can lead to degraded performance
> when the index scan matches too many elements that then are read from
> disk only to be dropped as non-matching during the recheck phase. *Is
> that understanding correct?*
> 

Correct. GIN gives you "candidate" rows, but the recheck may still
eliminate those.

> Now to the issue... pg_trgm's similarity search can use similarity
> operator % to search for "similar" documents. Concept of "similarity" is
> based on a similarity of trigram array extracted from the query string
> and trigram arrays of searched values. This concept is quite tricky in a
> sense that just by matching trigrams in GIN index PostgreSQL can not
> tell if the final value will match or not as it does not know how many
> trigrams overall are there in that value... 
> 
> Consider following example:
> 
> CREATE TABLE test (id SERIAL, value TEXT);
> CREATE INDEX test_idx ON test USING GIN (value gin_trgm_ops);
> INSERT INTO test (value) SELECT 'lorem ipsum ' || id || repeat('foo
> bar', CAST(random() * 100 AS INT)) FROM generate_series(1, 100000)
> source(id);
> 
> SET pg_trgm.similarity_threshold TO 0.5;
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem';
>                                                          QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on test  (cost=2024.08..2062.86 rows=10 width=374)
> (actual time=2261.727..2261.728 rows=0 loops=1)
>    Recheck Cond: (value % 'lorem'::text)
>    Rows Removed by Index Recheck: 100000
>    Heap Blocks: exact=5025
>    Buffers: shared hit=5636
>    ->  Bitmap Index Scan on test_idx  (cost=0.00..2024.08 rows=10
> width=0) (actual time=19.242..19.242 rows=100000 loops=1)
>          Index Cond: (value % 'lorem'::text)
>          Buffers: shared hit=611
>  Planning:
>    Buffers: shared hit=1
>  Planning Time: 2.417 ms
>  Execution Time: 2261.765 ms
> (12 rows)
> 
> 
> If I understand this correctly the *index scan really matches all 100000
> items that are then read from disk* only *to be discarded during the
> recheck*. So 2 seconds of doing all that work to return zero results
> (and I was lucky in my example to only have shared buffer hits, so no
> real disk I/O).
> 
> *Is my understanding correct that this happens only because pg_trgm is
> not able to actually determine if the matched item from the index search
> is actually much much longer than the query?* 

Yes, I think that's mostly correct understanding. The trouble here is
that the posting list for "lorem" is very long - it contains TID for
pretty much every row in the table. We can't calculate similarity at
that point from trigrams alone, so we have to fetch the rows.

> Is there any way how the
> performance can be improved in this case? I thought that I can store
> number of trigrams in the index, but that is not being used by the query
> planner:
> 
> CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops,
> array_length(show_trgm(value), 1));
> 
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem' AND
> array_length(show_trgm(value), 1) < array_length(show_trgm('lorem'), 1)
> / 0.5;
>                                                         QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on test  (cost=56.08..94.96 rows=3 width=376) (actual
> time=2273.225..2273.226 rows=0 loops=1)
>    Recheck Cond: (value % 'lorem'::text)
>    Rows Removed by Index Recheck: 100000
>    Filter: ((array_length(show_trgm(value), 1))::numeric <
> 12.0000000000000000)
>    Heap Blocks: exact=5025
>    Buffers: shared hit=5134
>    ->  Bitmap Index Scan on test_idx3  (cost=0.00..56.08 rows=10
> width=0) (actual time=15.945..15.946 rows=100000 loops=1)
>          Index Cond: (value % 'lorem'::text)
>          Buffers: shared hit=109
>  Planning:
>    Buffers: shared hit=3
>  Planning Time: 2.394 ms
>  Execution Time: 2273.256 ms
> (13 rows)
> 
> 
> Thank you for any sort of insight into this.

I don't think indexing the number of trigrams like this can help, and
I'm not sure how to improve this (at least for the built-in GIN). It
seem similarity searches are bound to be proportional to the most
frequent trigram in the query.

I wonder if the "newer" GIN variants like RUM [1] could improve this,
but I don't think it has trgm opclasses.

regards

[1] https://github.com/postgrespro/rum

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company