I don't see your query uses index :) On Tue, 12 Jul 2011, Nicolas Grilly wrote:
Hello, I'm testing PostgreSQL full-text search on a table containing 1.000.000 documents. Document average length is 5.700 chars. Performance is good and very similar to what I can get with Xapian if I don't use ts_rank. But response time collapses if I use ts_rank to select the 50 best matching documents. This is the table and index definition: create table posts_1000000 ( id serial primary key, document_vector tsvector ); create index index_posts_documents_1000000 ON posts_1000000 USING gin(document_vector); This is the query without ts_rank (the word 'crare' matches 5 % of documents): select id from posts_1000000 where to_tsquery('english', 'crare') @@ document_vector limit 50 Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559 rows=50 loops=1) Output: id -> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184 width=4) (actual time=0.299..12.451 rows=50 loops=1) Output: id Filter: ('''crare'''::tsquery @@ document_vector) Total runtime: 12.642 ms Now, this is the query using ts_rank: select id from posts_1000000 where to_tsquery('english', 'crare') @@ document_vector order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc limit 50 Limit (cost=29229.33..29229.45 rows=50 width=22) (actual time=355516.233..355516.339 rows=50 loops=1) Output: id -> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual time=355516.230..355516.268 rows=50 loops=1) Output: id Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32)) Sort Method: top-N heapsort Memory: 27kB -> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184 width=22) (actual time=0.251..355389.367 rows=49951 loops=1) Output: id Filter: ('''crare'''::tsquery @@ document_vector) Total runtime: 355535.063 ms The ranking is very slow: 140 ranked documents / second on my machine! I'm afraid this is because ts_rank needs to read document_vector, and because that column is stored in TOAST table, it triggers a random access for each matching row. Am I correct? Is it the expected behavior? Is there a way to reduce the execution time? I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB. Thanks for your help and advice.
Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general