Full text search - query plan? PG 8.4.1

Jesper Krogh <jesper@xxxxxxxx> · Sun, 18 Oct 2009 18:49:45 +0200

Hi.

I'm currently testing out PostgreSQL's Full Text Search capabillities.
We're currenly using Xapian, it has some nice features and some
drawbacks (sorting), so it is especially this area I'm investigating.

I've loaded the database with 50K documents, and the table definition
is:

ftstest=# \d uniprot
                               Table "public.uniprot"
      Column      |   Type   |                      Modifiers

------------------+----------+------------------------------------------------------
 id               | integer  | not null default
nextval('textbody_id_seq'::regclass)
 body             | text     | not null default ''::text
 textbody_body_fts | tsvector |
 accession_number | text     | not null default ''::text
Indexes:
    "accno_unique_idx" UNIQUE, btree (accession_number)
    "textbody_tfs_idx" gin (textbody_body_fts)
Triggers:
    tsvectorupdate BEFORE INSERT OR UPDATE ON textbody FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger('textbody_body_fts',
'pg_catalog.english', 'body')

"commonterm" matches 37K of the 50K documents (majority), but the query
plan is "odd" in my eyes.

* Why does it mis-guess the cost of a Seq Scan on textbody so much?
* Why doesn't it use the index in "id" to fetch the 10 records?

ftstest=#  ANALYZE textbody;
ANALYZE
ftstest=# explain analyze select body from textbody where
textbody_body_fts @@ to_tsquery('commonterm') order by id limit 10 offset 0
                                                          QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2841.08..2841.11 rows=10 width=5) (actual
time=48031.563..48031.568 rows=10 loops=1)
   ->  Sort  (cost=2841.08..2933.01 rows=36771 width=5) (actual
time=48031.561..48031.564 rows=10 loops=1)
         Sort Key: id
         Sort Method:  top-N heapsort  Memory: 31kB
         ->  Seq Scan on textbody  (cost=0.00..2046.47 rows=36771
width=5) (actual time=100.107..47966.590 rows=37133 loops=1)
               Filter: (textbody_body_fts @@ to_tsquery('commonterm'::text))
 Total runtime: 48031.612 ms
(7 rows)

This query-plan doesn't answer the questions above, but it does indeed
speed it up significantly (by heading into a Bitmap Index Scan instead
of a Seq Scan)

ftstest=# set enable_seqscan=off;
SET

ftstest=# explain analyze select body from textbody where
textbody_body_fts @@ to_tsquery('commonterm') order by id limit 10 offset 0

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=269942.41..269942.43 rows=10 width=5) (actual
time=47.567..47.572 rows=10 loops=1)
   ->  Sort  (cost=269942.41..270034.34 rows=36771 width=5) (actual
time=47.565..47.567 rows=10 loops=1)
         Sort Key: id
         Sort Method:  top-N heapsort  Memory: 31kB
         ->  Bitmap Heap Scan on textbody  (cost=267377.23..269147.80
rows=36771 width=5) (actual time=15.763..30.576 rows=37133 loops=1)
               Recheck Cond: (textbody_body_fts @@
to_tsquery('commonterm'::text))
               ->  Bitmap Index Scan on textbody_tfs_idx
(cost=0.00..267368.04 rows=36771 width=0) (actual time=15.419..15.419
rows=37134 loops=1)
                     Index Cond: (textbody_body_fts @@
to_tsquery('commonterm'::text))
 Total runtime: 47.634 ms
(9 rows)

To me it seems like the query planner could do a better job?

On "rare" terms everything seems to work excellent.

N.B.: looks a lot like this:
http://archives.postgresql.org/pgsql-performance/2009-07/msg00190.php

-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance