Re: Ugh - bad plan with LIMIT in a complex SELECT, any way to fix this?

Jesper Krogh <jesper@xxxxxxxx> · Sun, 09 May 2010 08:53:49 +0200

Overal comment.. Try reading hrough these old threads most of your
problem is the same issue:

http://article.gmane.org/gmane.comp.db.postgresql.performance/22395/match=gin
http://thread.gmane.org/gmane.comp.db.postgresql.performance/22331/focus=22434

                                       Table "public.post"
   Column   |           Type           |
Modifiers
-----------+--------------------------+--------------------------------------------------------
subject   | text                     |
  message   | text                     |
  inserted  | timestamp with time zone |
  modified  | timestamp with time zone |
  replied   | timestamp with time zone |
  ordinal   | integer                  | not null default
nextval('post_ordinal_seq'::regclass)

Indexes:
     "post_pkey" PRIMARY KEY, btree (ordinal)
     "idx_message" gin (to_tsvector('english'::text, message))
     "idx_subject" gin (to_tsvector('english'::text, subject))

There's a bunch of other stuff in the table and many more indices, plus
foreign references, but stripping the table down to JUST THIS shows the
problem.

ticker=# explain analyze select * from post where to_tsvector('english',
message) @@ to_tsquery('violence') order by modified desc;
                                                             QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=31795.16..31819.68 rows=9808 width=436) (actual
time=14.222..17.213 rows=3421 loops=1)
    Sort Key: modified
    Sort Method:  quicksort  Memory: 3358kB
    ->   Bitmap Heap Scan on post  (cost=1418.95..31144.90 rows=9808
width=436) (actual time=1.878..7.514 rows=3421 loops=1)
          Recheck Cond: (to_tsvector('english'::text, message) @@
to_tsquery('violence'::text))
          ->   Bitmap Index Scan on idx_message  (cost=0.00..1416.49
rows=9808 width=0) (actual time=1.334..1.334 rows=3434 loops=1)
                Index Cond: (to_tsvector('english'::text, message) @@
to_tsquery('violence'::text))
  Total runtime: 20.547 ms
(8 rows)

Ok, very nice.  20ms.  I like that.

Now lets limit the return to 100 items:

ticker=# explain analyze select * from post where to_tsvector('english',
message) @@ to_tsquery('violence') order by modified desc limit 100;

QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..5348.69 rows=100 width=436) (actual
time=198.047..2607.077 rows=100 loops=1)
    ->   Index Scan Backward using post_modified on post
(cost=0.00..524599.31 rows=9808 width=436) (actual
time=198.043..2606.864 rows=100 loops=1)
          Filter: (to_tsvector('english'::text, message) @@
to_tsquery('violence'::text))
  Total runtime: 2607.231 ms
(4 rows)

Bad.  Notice that the optimizer decided it was going to do an index scan
with an internal filter on it!  That's BACKWARD; what I want is for the
planner to first execute the index scan on the GIN index, then order the
return and limit the returned data set.

But it gets much worse - let's use something that's NOT in the message
base (the table in question has some ~2m rows by the way and consumes
several gigabytes on disk - anything that actually READS the table is
instant "bad news!")

The one problem is that the query-planner doesn't have any
specific knowlege about the cost of the gin-index search. Thats
mentioned in one of the above threads.

The other problem is that the cost of "to_tsvector" and "ts_match_vq"
are set way to conservative in the default installation. Bumping those
up will increase your amount of correct plans, but it doesnt solve all 
of it
since the above problem is also interferring. But try upping the cost
of those two functions significantly.

alter function ts_match_vq(tsvector,tsquery) cost 500
(upping the cost times 500 for that one). I've I've got it right it is "more in
the correct ballpark" it more or less translates to "how much more expensive the function
is compared to really simple operators").

Another thing you can do, that favours the running time of the queries
using to_tsvector() is to specifically store the tsvector in the table and
create an index on that. That will at run-time translate into fewer
calls (0 to be precise) of to_tsvector and only costing the ts_match_vq
at run-time.

Why is the planner "taking into consideration" the LIMIT (I know the
docs say it does) and choosing to sequentially scan a table of nearly 2
million rows?!  I don't see how that makes sense.... irrespective of the
query being LIMITed.

If it matters setting enable_seqscan OFF does not impact the results.

No, because you end up in index-scans on non-gin indexes in that
situtaion.. so turning seqscan off has no effect.

--
Jesper

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance