Re: Full Text Search combined with Fuzzy

Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx> · Fri, 3 Mar 2017 16:08:19 +0300

On 03.03.2017 15:49, Nicolas Paris wrote:

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
                   texts                   | similarity
-------------------------------------------+------------
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  |     0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 width=0)
         Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !

Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For 
example:

postgres=# SELECT texts, word_similarity('hello word', texts) FROM 
test_trgm WHERE 'hello word' <% texts;
                   texts                   | word_similarity
-------------------------------------------+-----------------
 blah blah blah hello world blah blah blah |        0.818182
 blah blah blah hello word blah blah blah  |               1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general