"explain analyse" much slower than actual query

"Phil Endecott" <spam_from_postgresql_general@xxxxxxxxxxxx> · Sun, 28 Jan 2007 16:31:06 +0000

Dear All,

I want to find all of the msg_ids from "messages" that are not in table 
"part_tsearch", where both tables are large but the result of the 
query is normally small, and often empty.  To 'protect' the application
against the unusual case where the result of the query is large I have 
added a limit clause:

select msg_id from messages where msg_id not in (select msg_id from part_tsearch) limit 10;

Currently there are about 30,000 rows in each table and about 500
rows in the result of the (un-limit-ed) query.  So it is unusual in
the sense that the size of the result is relatively large and the limit
clause will take effect.  Both tables are indexed by msg_id.

This was taking longer than I expected, so I decided to 
explain-analyse it.  However, when I run it under explain-analyse, 
it takes even longer than before:

decimail=> select msg_id from messages where msg_id not in (select msg_id from part_tsearch) limit 10;
 msg_id 
--------
  67894
  67809
  52548
  67745
  67538
  67540
  67329
  67246
  67235
  67140
(10 rows)

(that takes about 2 seconds)

decimail=> explain analyse select msg_id from messages where msg_id not in (select msg_id from part_tsearch) limit 10;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4301.99..10534.34 rows=10 width=4) (actual time=6677.791..72417.068 rows=10 loops=1)
   ->  Seq Scan on messages  (cost=4301.99..11966058.86 rows=19193 width=4) (actual time=6677.725..72416.427 rows=10 loops=1)
         Filter: (NOT (subplan))
         SubPlan
           ->  Materialize  (cost=4301.99..4830.07 rows=37908 width=4) (actual time=0.097..39.494 rows=862 loops=903)
                 ->  Seq Scan on part_tsearch  (cost=0.00..4115.08 rows=37908 width=4) (actual time=0.104..1679.258 rows=37901 loops=1)
 Total runtime: 72424.066 ms
(7 rows)

As I increase the limit the runtime increases as follows:

limit   normal runtime   explain-anlyse runtime
10      2                77
20      5                309
40      12               807
80      88               
160     149              
320     1016

I was not patient enough to wait for the remaining explain-analyse results,
but I feel that there is a linear slowdown of about 60x between the raw 
query and the explain-analyse version.

In general, for other queries, explain-analyse reports runtimes that agree 
with the runtimes of the actual queries.

But the peculiar behaviour of explain-analyse is really a distraction from
the fact that the query is slow, especially when the limit value is large.

The system seems to be largely CPU-bound during these long run-times.

The query plan reported by explain-analyse is the same in each case.  How
many times is it actually doing the seq-scan on part_tsearch?  I see
that the rows value reported for "Materialize" is rather different in the
planned and actual numbers.  What is this telling me?  I analysed the
tables immediately before starting on this.

I was hoping that it would be implemented using some sort of index scan on
the two tables, maybe something involing bitmaps.  Is there something that 
I can do to the query, or to my indexes, to make this happen?

I tried using "except" rather than "not in <subquery>"; this isn't 
exactly identical as "except" is required to return sorted results, which 
I don't need; when the limit clause is in effect I don't care
which of the possible ids are returned and I never care about the order.
In this case the runtimes are as follows:

limit   normal runtime   explain-anlyse runtime
10      15               54
20      15               55
40      19               94
80      9                55
160     20               68
320     10               70

Note that again explain-analyse is slower than the normal runtime, but now
by a factor of about 5 rather than the previous factor of about 60.

Basically, the query runtimes are now essentially constant; I imagine that 
they would be flatter if the machine wasn't also running other processes.
The query plan shows that it is doing a sequential pass over each of the
tables, sorting and then doing a set-op - exactly as expected and OK
for large limits, but not for small limits.

I feel that it should be possible to do this in miliseconds, not seconds,
using the existing indexes.  

This is with Postgresql 8.1 on Debian GNU/Linux.

(BTW, this is for Decimail, my postgresql-based IMAP mail server.  I have
recently added tsearch2-based searching.  It is still somewhat experimental
but has never lost any mail.  If you're interested, have a look at
http://decimail.org/.)

Many thanks for any advice,

Phil.

(You are welcome to CC: me in any replies)