Re: suggestions on improving a query

Rajarshi Guha <rguha@xxxxxxxxxxx> · Wed, 14 Feb 2007 08:22:42 -0500

On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote:
> "Adam Rich" <adam.r@xxxxxxxxxxxxx> writes:
> > This line:
> > Index Scan using plp_total_idx on dockscore_plp  
> > (cost=0.00..16733229.92 rows=4669988 width=80) 
> > (actual time=98.323..322537.605 rows=25197 loops=1)
> > Means the planner did what it did, because it estimated there would be
> > nearly 5 million rows.  However, there were only 25,000.

Sorry for not doing the obvious beforehand! I increased the statistics
target for some of the columns in some of the tables and then did a
vacuum analyze. Rerunning the query gives:

                                                                              QUERY PLAN                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1)
         ->  Nested Loop  (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10 loops=1)
               ->  Index Scan using plp_total_idx on dockscore_plp  (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)
               ->  Index Scan using dock_pkey on dock  (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718 rows=0 loops=25197)
                     Index Cond: (dock.dockid = "outer".id)
                     Filter: (target = '1YC1'::text)
         ->  Index Scan using pubchem_compound_pkey on pubchem_compound  (cost=0.00..5.97 rows=1 width=187) (actual time=34.221..34.223 rows=1 loops=10)
               Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
 Total runtime: 171722.964 ms
(10 rows)

Clearly a big improvement in performance.

(One question not directly related to the problem: when looking at the
output of explain analyze, I know that one is supposed to start at the
bottom and move up. Does that that the index scan on pubchem_compound is
being performed first? Or should I start from the innermost line?)

However it seems that it could still be improved:

   ->  Index Scan using plp_total_idx on dockscore_plp  (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)

It looks like theres a big mismatch on the expected and observed costs and times.

> The real problem here is that the planner is guessing that it won't take
> very long to find 10 rows satisfying the target = '1YC1' condition while
> scanning in dockscore_plp.total order.  So it chooses a plan that would
> have a long total runtime (notice the large cost estimates below the
> Limit) expecting that only a small fraction of that total will actually
> be expended.  The expectation seems a bit off unfortunately :-(.
> I can't tell from the given data whether the problem is just an
> overestimate of the frequency of target = '1YC1', or if there's an
> additional effect. 

I think that increasing the statistics has improved that.

>  For example, if that target value tended to only be
> associated with larger values of dockscore_plp.total, then a plan like
> this could lose big-time because it will have to scan a long way to find
> those rows.

This is not the case. The value '1YC1' will be associated with both high
and low values of dockscore_plp.total

What I would like my query to do is this:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Using dock.cid from the rows in 2., get the corresponding rows in
pubchem_compound

4. Sort and take the top 10 from step 2 (and associated rows in step 3)

However now that I have written this it seems that what I really want to
do is:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Sort and take the top 10

4. Get the corresponding rows from pubchem_compound.cid

The problem with this is that step is represented by the

dock.cid = pubchem_compound.cid

clause. It seems that if I had the cid column in dockscore_plp, then I
could do a sort+limit in dockscore_plp and then simply lookup the
corresponding (10) rows in pubchem_compound (rather than looking up 960K
rows). The downside to this is that there are 4 more tables like
dockscore_plp, and I would have to add a cid column to each of them -
which seems redundant.

Is it useful to increase redundancy to improve performance?

Thanks for the pointers,

-------------------------------------------------------------------
Rajarshi Guha <rguha@xxxxxxxxxxx>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
There's no problem so bad that you can't add some guilt to it to make
it worse.
-Calvin