On 5/24/23 22:28, Pavel Horal wrote: > Hello PostgreSQL community, > > I am trying to improve performance of using similarity based queries on > a large datasets and I would like to confirm my understanding of how GIN > indexes work and how pg_trgm uses them. > > If I understand it correctly, using GIN index is always a two step > process: first, potential matches are searched in the index; then as a > second step tuples are actually fetched and rechecked if they really > match the query. This two step process can lead to degraded performance > when the index scan matches too many elements that then are read from > disk only to be dropped as non-matching during the recheck phase. *Is > that understanding correct?* > Correct. GIN gives you "candidate" rows, but the recheck may still eliminate those. > Now to the issue... pg_trgm's similarity search can use similarity > operator % to search for "similar" documents. Concept of "similarity" is > based on a similarity of trigram array extracted from the query string > and trigram arrays of searched values. This concept is quite tricky in a > sense that just by matching trigrams in GIN index PostgreSQL can not > tell if the final value will match or not as it does not know how many > trigrams overall are there in that value... > > Consider following example: > > CREATE TABLE test (id SERIAL, value TEXT); > CREATE INDEX test_idx ON test USING GIN (value gin_trgm_ops); > INSERT INTO test (value) SELECT 'lorem ipsum ' || id || repeat('foo > bar', CAST(random() * 100 AS INT)) FROM generate_series(1, 100000) > source(id); > > SET pg_trgm.similarity_threshold TO 0.5; > EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem'; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=2024.08..2062.86 rows=10 width=374) > (actual time=2261.727..2261.728 rows=0 loops=1) > Recheck Cond: (value % 'lorem'::text) > Rows Removed by Index Recheck: 100000 > Heap Blocks: exact=5025 > Buffers: shared hit=5636 > -> Bitmap Index Scan on test_idx (cost=0.00..2024.08 rows=10 > width=0) (actual time=19.242..19.242 rows=100000 loops=1) > Index Cond: (value % 'lorem'::text) > Buffers: shared hit=611 > Planning: > Buffers: shared hit=1 > Planning Time: 2.417 ms > Execution Time: 2261.765 ms > (12 rows) > > > If I understand this correctly the *index scan really matches all 100000 > items that are then read from disk* only *to be discarded during the > recheck*. So 2 seconds of doing all that work to return zero results > (and I was lucky in my example to only have shared buffer hits, so no > real disk I/O). > > *Is my understanding correct that this happens only because pg_trgm is > not able to actually determine if the matched item from the index search > is actually much much longer than the query?* Yes, I think that's mostly correct understanding. The trouble here is that the posting list for "lorem" is very long - it contains TID for pretty much every row in the table. We can't calculate similarity at that point from trigrams alone, so we have to fetch the rows. > Is there any way how the > performance can be improved in this case? I thought that I can store > number of trigrams in the index, but that is not being used by the query > planner: > > CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops, > array_length(show_trgm(value), 1)); > > EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem' AND > array_length(show_trgm(value), 1) < array_length(show_trgm('lorem'), 1) > / 0.5; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=56.08..94.96 rows=3 width=376) (actual > time=2273.225..2273.226 rows=0 loops=1) > Recheck Cond: (value % 'lorem'::text) > Rows Removed by Index Recheck: 100000 > Filter: ((array_length(show_trgm(value), 1))::numeric < > 12.0000000000000000) > Heap Blocks: exact=5025 > Buffers: shared hit=5134 > -> Bitmap Index Scan on test_idx3 (cost=0.00..56.08 rows=10 > width=0) (actual time=15.945..15.946 rows=100000 loops=1) > Index Cond: (value % 'lorem'::text) > Buffers: shared hit=109 > Planning: > Buffers: shared hit=3 > Planning Time: 2.394 ms > Execution Time: 2273.256 ms > (13 rows) > > > Thank you for any sort of insight into this. I don't think indexing the number of trigrams like this can help, and I'm not sure how to improve this (at least for the built-in GIN). It seem similarity searches are bound to be proportional to the most frequent trigram in the query. I wonder if the "newer" GIN variants like RUM [1] could improve this, but I don't think it has trgm opclasses. regards [1] https://github.com/postgrespro/rum -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company