Re: ts_rank seems very slow (140 ranked documents / second on my machine)

Oleg Bartunov <oleg@xxxxxxxxxx> · Wed, 13 Jul 2011 19:36:11 +0400 (MSD)

I didn't notice, reading 40K tuples in random order takes a long time and this
is a problem of any database. Can you measure time to read all documents found ?
 :( The only solution I see is to store enough
information for ranking in index.

Oleg
On Wed, 13 Jul 2011, Nicolas Grilly wrote:

Hi Oleg and all,

On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov <oleg@xxxxxxxxxx> wrote:
there is problem with estimating of cost scanning gin index in < 9.1
versions,
so you can set enable_seqscan=off;
or try 9.1 which beta3 now.

I re-ran my queries using enable seqscan=off.

Now the first query, without ts_rank, uses the GIN index:

set enable_seqscan=off;
explain analyze select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
limit 50;

Limit  (cost=42290.12..42306.31 rows=50 width=4) (actual
time=16.259..16.412 rows=50 loops=1)
 ->  Bitmap Heap Scan on posts_1000000  (cost=42290.12..57877.02
rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1)
       Recheck Cond: ('''crare'''::tsquery @@ document_vector)
       ->  Bitmap Index Scan on index_posts_documents_1000000
(cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265
rows=49951 loops=1)
             Index Cond: ('''crare'''::tsquery @@ document_vector)
Total runtime: 16.484 ms

But the second query, the one that uses ts_rank, is still very slow...
Any idea why? Is ts_rank efficient enough to find the best 50 matches
among 50 000 documents?

set enable_seqscan=off;
explain analyze select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
limit 50;

Limit  (cost=59596.98..59597.10 rows=50 width=22) (actual
time=296212.052..296212.257 rows=50 loops=1)
 ->  Sort  (cost=59596.98..59717.36 rows=48152 width=22) (actual
time=296186.928..296187.007 rows=50 loops=1)"
       Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
       Sort Method:  top-N heapsort  Memory: 27kB
       ->  Bitmap Heap Scan on posts_1000000
(cost=42290.12..57997.40 rows=48152 width=22) (actual
time=70.861..296059.515 rows=49951 loops=1)
             Recheck Cond: ('''crare'''::tsquery @@ document_vector)
             ->  Bitmap Index Scan on index_posts_documents_1000000
(cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922
rows=49951 loops=1)
                   Index Cond: ('''crare'''::tsquery @@ document_vector)
Total runtime: 296220.493 ms

By the way, does ts_rank is supposed to use a GIN index when it's
available?

no, I see no benefit :)

Ok. But what is the solution to improve ts_rank execution time? Am I
doing something wrong?

Thanks for your help,

Nicolas

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general