Re: Full text search - query plan? PG 8.4.1

Jesper Krogh <jesper@xxxxxxxx> · Fri, 23 Oct 2009 22:32:53 +0200

Tom Lane wrote:
> Jesper Krogh <jesper@xxxxxxxx> writes:
>> Tom Lane wrote:
>>> ... There's something strange about your tsvector index.  Maybe
>>> it's really huge because the documents are huge?
> 
>> huge is a relative term, but length(ts_vector(body)) is about 200 for
>> each document. Is that huge?
> 
> It's bigger than the toy example I was trying, but not *that* much
> bigger.  I think maybe your index is bloated.  Try dropping and
> recreating it and see if the estimates change any.

I'm a bit reluctant to dropping it and re-creating it. It'll take a
couple of days to regenerate, so this should hopefully not be an common
situation for the system.

I have set the statistics target to 1000 for the tsvector, the
documentation didn't specify any heavy negative sides of doing that and
since that I haven't seen row estimates that are orders of magnitude off.

It is build from scratch using inserts all the way to around 10m now,
should that result in index-bloat? Can I inspect the size of bloat
without rebuilding (or similar locking operation)?

The query still has a "wrong" tipping point between the two query-plans:

ftstest=# explain analyze select body from ftstest where
ftstest_body_fts @@ to_tsquery('testterm') order by id limit 100;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..7357.77 rows=100 width=738) (actual
time=3978.974..8595.086 rows=100 loops=1)
   ->  Index Scan using ftstest_id_pri_idx on ftstest
(cost=0.00..1436458.05 rows=19523 width=738) (actual
time=3978.971..8594.932 rows=100 loops=1)
         Filter: (ftstest_body_fts @@ to_tsquery('testterm'::text))
 Total runtime: 8595.222 ms
(4 rows)

ftstest=# set enable_indexscan=off;
SET
ftstest=# explain analyze select body from ftstest where
ftstest_body_fts @@ to_tsquery('testterm') order by id limit 100;
                                                                   QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=59959.61..59959.86 rows=100 width=738) (actual
time=338.832..339.055 rows=100 loops=1)
   ->  Sort  (cost=59959.61..60008.41 rows=19523 width=738) (actual
time=338.828..338.908 rows=100 loops=1)
         Sort Key: id
         Sort Method:  top-N heapsort  Memory: 32kB
         ->  Bitmap Heap Scan on ftstest  (cost=22891.18..59213.45
rows=19523 width=738) (actual time=5.097..316.780 rows=19444 loops=1)
               Recheck Cond: (ftstest_body_fts @@
to_tsquery('testterm'::text))
               ->  Bitmap Index Scan on ftstest_tfs_idx
(cost=0.00..22886.30 rows=19523 width=0) (actual time=4.259..4.259
rows=20004 loops=1)
                     Index Cond: (ftstest_body_fts @@
to_tsquery('testterm'::text))
 Total runtime: 339.201 ms
(9 rows)

So for getting 100 rows where the term exists in 19.444 of 10.000.000
documents it chooses the index-scan where it (given random distribution
of the documents) should scan: 100*(10000000/19444) = 51429 documents.
So it somehow believes that the cost for the bitmap index scan is higher
than it actually is or the cost for the index-scan is lower than it
actually is.

Is is possible to manually set the cost for the @@ operator? It seems
natural that matching up a ts_vector to a ts_query, which is a much
heavier operation  than = and even is stored in EXTENDED storage should
be much higher than a integer in plain storage.

I tried to search docs for operator cost, but I only found the overall
ones in the configuration file that are base values.

Jesper
-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance