Re: Queryplan within FTS/GIN index -search.

Jesper Krogh <jesper@xxxxxxxx> · Tue, 03 Nov 2009 19:36:13 +0100

Tom Lane wrote:
> It may well be that Jesper's identified a place where the GIN code could
> be improved --- it seems like having the top-level search logic be more
> aware of the AND/OR structure of queries would be useful.  But the
> particular example shown here doesn't make a very good case for that,
> because it's hard to tell how much of a penalty would be taken in more
> realistic examples.

With a term sitting in:
80% of the docs the penalty is: x23
60% of the docs the penalty is: x17
40% of the docs the penalty is: x13
of doing
vectorcol @@ ts_query('term & commonterm')
compared to
vectorcol @@ ts_query('term) and vectorcol @@ ts_query('commonterm');
where term is non-existing (or rare).

(in query execution performance on a fully memory recident dataset,
doing test with "drop_caches" and restart pg to simulate a dead disk the
numbers are a bit higher).

http://article.gmane.org/gmane.comp.db.postgresql.performance/22496/match=

Would you ever quantify a term sitting in 60-80% as a stop-word candidate?

I dont know if x13 in execution performance is worth hunting or there
are lower hanging fruits sitting in the fts-search-system.

This is essentially the penalty the user will get for adding a terms to
their search that rarely restricts the results.

In term of the usual "set theory" that databases work in, a search for a
stop-word translated into the full set. This is just not the case in
where it throws a warning and returns the empty set. This warning can be
caught by application code to produce the "correct" result to the users,
but just slightly more complex queries dont do this:

ftstest=# select id from ftstest where body_fts @@ to_tsquery('random |
the') limit 10;
 id
----
(0 rows)

Here I would have expected the same error.. I basically have to hook in
the complete stop-word dictionary in a FTS-preparser to give the user
the expected results or have I missed a feature somwhere?

My reason for not pushing "commonterms" into the stopword list is that
they actually perform excellent in PG.

Same body as usual, but commonterm99 is sitting in 99% of the documents.

ftstest=# set enable_seqscan=off;
SET
ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('commonterm99');
                                                                QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on ftstest  (cost=1051476.74..1107666.07 rows=197887
width=4) (actual time=51.036..121.348 rows=197951 loops=1)
   Recheck Cond: (body_fts @@ to_tsquery('commonterm99'::text))
   ->  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..1051427.26
rows=197887 width=0) (actual time=49.602..49.602 rows=197951 loops=1)
         Index Cond: (body_fts @@ to_tsquery('commonterm99'::text))
 Total runtime: 147.350 ms
(5 rows)

ftstest=# set enable_seqscan=on;
SET
ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('commonterm99');
                                                    QUERY PLAN

------------------------------------------------------------------------------------------------------------------
 Seq Scan on ftstest  (cost=0.00..56744.00 rows=197887 width=4) (actual
time=0.086..7134.384 rows=197951 loops=1)
   Filter: (body_fts @@ to_tsquery('commonterm99'::text))
 Total runtime: 7194.182 ms
(3 rows)

So in order to get the result with a speedup of more than x50 I simply
cannot add these terms to the stop-words because then the first query
would resolve to an error and getting results would then be up to the
second query.

My bet is that doing a seq_scan will "never" be beneficial for this type
of query.

As far as I can see the only consequence of simply not remove stop-words
at all is a (fairly small) increase in index-size. It seems to me that
stop-words were invented when it was hard to get more than 2GB of memory
into a computer to get the index-size reduced to a size that better
could fit into memory. But nowadays it seems like the downsides are hard
to see?

Jesper
-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance