On Thu, Sep 22, 2011 at 11:40 AM, Jonathan Bartlett <jonathan.l.bartlett@xxxxxxxxx> wrote:
I am working on a fuzzy search of a large dataset. Basically, it is a list of all of the songs, albums, artists, movies, and celebrities exported from Freebase. Anyway, I was hoping to get a fuzzy search that was nearly as fast as the full-text search with the new nearest-neighbor GIST indexes, but, while it is improved from 9.0, it is still taking some time. The table has about 16 million rows, each with a "name" column that is usually 2-10 words.My query using full-text search is this:explain analyze select * from entities where to_tsvector('unaccented_english', entities.name) @@ plainto_tsquery('unaccented_english', 'bon jovi');QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on entities (cost=42.64..1375.62 rows=340 width=1274) (actual time=0.422..0.617 rows=109 loops=1)Recheck Cond: (to_tsvector('unaccented_english'::regconfig, (name)::text) @@ '''bon'' & ''jovi'''::tsquery)-> Bitmap Index Scan on entity_unaccented_name_gin_index (cost=0.00..42.56 rows=340 width=0) (actual time=0.402..0.402 rows=109 loops=1)Index Cond: (to_tsvector('unaccented_english'::regconfig, (name)::text) @@ '''bon'' & ''jovi'''::tsquery)Total runtime: 0.728 ms(5 rows)My new query using trigrams is this:explain analyze select * from entities where name % 'bon jovi';QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on entities (cost=913.73..46585.14 rows=13615 width=1274) (actual time=7769.380..7772.739 rows=326 loops=1)Recheck Cond: ((name)::text % 'bon jovi'::text)-> Bitmap Index Scan on tmp_entity_name_trgm_gist_idx (cost=0.00..910.33 rows=13615 width=0) (actual time=7769.307..7769.307 rows=326 loops=1)Index Cond: ((name)::text % 'bon jovi'::text)Total runtime: 7773.008 msIf I put a limit on it, it gets better, but is still pretty bad:explain analyze select * from entities where name % 'bon jovi' limit 50;QUERY PLAN-------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=0.00..200.14 rows=50 width=1274) (actual time=1.246..1226.146 rows=50 loops=1)-> Index Scan using tmp_entity_name_trgm_gist_idx on entities (cost=0.00..54498.48 rows=13615 width=1274) (actual time=1.243..1226.016 rows=50 loops=1)Index Cond: ((name)::text % 'bon jovi'::text)Total runtime: 1226.261 ms(4 rows)And if I try to get the "best" matches, the performance goes completely down the tubes, even with a limit:explain analyze select * from entities where name % 'bon jovi' order by name <-> 'bon jovi' limit 50;QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=0.00..200.39 rows=50 width=1274) (actual time=421.811..8058.877 rows=50 loops=1)-> Index Scan using tmp_entity_name_trgm_gist_idx on entities (cost=0.00..54566.55 rows=13615 width=1274) (actual time=421.808..8058.766 rows=50 loops=1)Index Cond: ((name)::text % 'bon jovi'::text)Order By: ((name)::text <-> 'bon jovi'::text)Total runtime: 8060.760 msAnyway, this may just be a limitation of the trigram indexing, but my hope was to get a fuzzy search that at least approached the performance of the full-text searching. Am I missing something, or am I just bumping into the limits? I also noticed that different strings get radically different performance. Searching for "hello" drops the search time down to 310ms! But searching for 'hello my friend' brings the search time to 9616ms!explain analyze select * from entities where name % 'hello' order by name <-> 'hello' limit 50;QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=0.00..200.39 rows=50 width=1274) (actual time=5.056..309.492 rows=50 loops=1)-> Index Scan using tmp_entity_name_trgm_gist_idx on entities (cost=0.00..54566.55 rows=13615 width=1274) (actual time=5.053..309.393 rows=50 loops=1)Index Cond: ((name)::text % 'hello'::text)Order By: ((name)::text <-> 'hello'::text)Total runtime: 309.637 msexplain analyze select * from entities where name % 'hello my friend' order by name <-> 'hello my friend' limit 50;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=0.00..200.39 rows=50 width=1274) (actual time=76.358..9616.066 rows=50 loops=1)-> Index Scan using tmp_entity_name_trgm_gist_idx on entities (cost=0.00..54566.55 rows=13615 width=1274) (actual time=76.356..9615.968 rows=50 loops=1)Index Cond: ((name)::text % 'hello my friend'::text)Order By: ((name)::text <-> 'hello my friend'::text)Total runtime: 9616.203 msFor reference, here is my table structure:\d entitiesTable "public.entities"Column | Type | Modifiers----------------------+-----------------------------+-------------------------------------------------------id | integer | not null default nextval('entities_id_seq'::regclass)name | character varying(255) |disambiguation | character varying(255) |description | text |entity_basic_type | character varying(255) |entity_extended_type | character varying(255) |primary | boolean | default truesemantic_world_id | integer |calc_completed | boolean | default truesource | text |source_entity_id | integer |created_at | timestamp without time zone |updated_at | timestamp without time zone |data_import_id | integer |validated | boolean | default trueweight | integer |description_source | text |description_url | text |rating | text |Indexes:"entities_pkey" PRIMARY KEY, btree (id)"entity_lower_name_idx" btree (lower(name::text) text_pattern_ops)"entity_name_gin_index" gin (to_tsvector('english'::regconfig, name::text))"entity_unaccented_name_gin_index" gin (to_tsvector('unaccented_english'::regconfig, name::text))"index_entities_on_data_import_id" btree (data_import_id)"index_entities_on_name" btree (name)"index_entities_on_source" btree (source)"tmp_entity_name_trgm_gist_idx" gist (name gist_trgm_ops)Thanks for the help!Jon