Re: Optimizing Trigram searches in PG 9.1

Dave Crooke <dcrooke@xxxxxxxxx> · Thu, 22 Sep 2011 18:03:51 -0500

Depending on your needs, you might consider putting the data into a columnar text search engine like Lucene, having it return the integer id's which can then be used for row lookups in PG.

On Thu, Sep 22, 2011 at 11:40 AM, Jonathan Bartlett <jonathan.l.bartlett@xxxxxxxxx> wrote:

I am working on a fuzzy search of a large dataset.  Basically, it is a list of all of the songs, albums, artists, movies, and celebrities exported from Freebase.  Anyway, I was hoping to get a fuzzy search that was nearly as fast as the full-text search with the new nearest-neighbor GIST indexes, but, while it is improved from 9.0, it is still taking some time.  The table has about 16 million rows, each with a "name" column that is usually 2-10 words.

My query using full-text search is this:
explain analyze select * from entities where to_tsvector('unaccented_english', entities.name) @@ plainto_tsquery('unaccented_english', 'bon jovi');

                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on entities  (cost=42.64..1375.62 rows=340 width=1274) (actual time=0.422..0.617 rows=109 loops=1)
   Recheck Cond: (to_tsvector('unaccented_english'::regconfig, (name)::text) @@ '''bon'' & ''jovi'''::tsquery)

   ->  Bitmap Index Scan on entity_unaccented_name_gin_index  (cost=0.00..42.56 rows=340 width=0) (actual time=0.402..0.402 rows=109 loops=1)
         Index Cond: (to_tsvector('unaccented_english'::regconfig, (name)::text) @@ '''bon'' & ''jovi'''::tsquery)

 Total runtime: 0.728 ms
(5 rows)

My new query using trigrams is this:
explain analyze select * from entities where name % 'bon jovi';
                                                                     QUERY PLAN                                                                     

----------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on entities  (cost=913.73..46585.14 rows=13615 width=1274) (actual time=7769.380..7772.739 rows=326 loops=1)

   Recheck Cond: ((name)::text % 'bon jovi'::text)
   ->  Bitmap Index Scan on tmp_entity_name_trgm_gist_idx  (cost=0.00..910.33 rows=13615 width=0) (actual time=7769.307..7769.307 rows=326 loops=1)

         Index Cond: ((name)::text % 'bon jovi'::text)
 Total runtime: 7773.008 ms

If I put a limit on it, it gets better, but is still pretty bad:

explain analyze select * from entities where name % 'bon jovi' limit 50;
                                                                         QUERY PLAN                                                                          

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..200.14 rows=50 width=1274) (actual time=1.246..1226.146 rows=50 loops=1)

   ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities  (cost=0.00..54498.48 rows=13615 width=1274) (actual time=1.243..1226.016 rows=50 loops=1)
         Index Cond: ((name)::text % 'bon jovi'::text)

 Total runtime: 1226.261 ms
(4 rows)

And if I try to get the "best" matches, the performance goes completely down the tubes, even with a limit:
explain analyze select * from entities where name % 'bon jovi' order by name <-> 'bon jovi' limit 50;

                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=0.00..200.39 rows=50 width=1274) (actual time=421.811..8058.877 rows=50 loops=1)
   ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities  (cost=0.00..54566.55 rows=13615 width=1274) (actual time=421.808..8058.766 rows=50 loops=1)

         Index Cond: ((name)::text % 'bon jovi'::text)
         Order By: ((name)::text <-> 'bon jovi'::text)
 Total runtime: 8060.760 ms

Anyway, this may just be a limitation of the trigram indexing, but my hope was to get a fuzzy search that at least approached the performance of the full-text searching.  Am I missing something, or am I just bumping into the limits?  I also noticed that different strings get radically different performance.  Searching for "hello" drops the search time down to 310ms!  But searching for 'hello my friend' brings the search time to 9616ms!

explain analyze select * from entities where name % 'hello' order by name <-> 'hello' limit 50;
                                                                         QUERY PLAN                                                                         

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..200.39 rows=50 width=1274) (actual time=5.056..309.492 rows=50 loops=1)

   ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities  (cost=0.00..54566.55 rows=13615 width=1274) (actual time=5.053..309.393 rows=50 loops=1)
         Index Cond: ((name)::text % 'hello'::text)
         Order By: ((name)::text <-> 'hello'::text)
 Total runtime: 309.637 ms

explain analyze select * from entities where name % 'hello my friend' order by name <-> 'hello my friend' limit 50;

                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=0.00..200.39 rows=50 width=1274) (actual time=76.358..9616.066 rows=50 loops=1)
   ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities  (cost=0.00..54566.55 rows=13615 width=1274) (actual time=76.356..9615.968 rows=50 loops=1)

         Index Cond: ((name)::text % 'hello my friend'::text)
         Order By: ((name)::text <-> 'hello my friend'::text)
 Total runtime: 9616.203 ms

For reference, here is my table structure:
\d entities
                                          Table "public.entities"
        Column        |            Type             |                       Modifiers                       

----------------------+-----------------------------+-------------------------------------------------------
 id                   | integer                     | not null default nextval('entities_id_seq'::regclass)

 name                 | character varying(255)      | 
 disambiguation       | character varying(255)      | 
 description          | text                        | 
 entity_basic_type    | character varying(255)      | 

 entity_extended_type | character varying(255)      | 
 primary              | boolean                     | default true
 semantic_world_id    | integer                     | 
 calc_completed       | boolean                     | default true

 source               | text                        | 
 source_entity_id     | integer                     | 
 created_at           | timestamp without time zone | 
 updated_at           | timestamp without time zone | 

 data_import_id       | integer                     | 
 validated            | boolean                     | default true
 weight               | integer                     | 
 description_source   | text                        | 

 description_url      | text                        | 
 rating               | text                        | 
Indexes:
    "entities_pkey" PRIMARY KEY, btree (id)
    "entity_lower_name_idx" btree (lower(name::text) text_pattern_ops)

    "entity_name_gin_index" gin (to_tsvector('english'::regconfig, name::text))
    "entity_unaccented_name_gin_index" gin (to_tsvector('unaccented_english'::regconfig, name::text))

    "index_entities_on_data_import_id" btree (data_import_id)
    "index_entities_on_name" btree (name)
    "index_entities_on_source" btree (source)
    "tmp_entity_name_trgm_gist_idx" gist (name gist_trgm_ops)

Thanks for the help!

Jon