On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote: > "Adam Rich" <adam.r@xxxxxxxxxxxxx> writes: > > This line: > > Index Scan using plp_total_idx on dockscore_plp > > (cost=0.00..16733229.92 rows=4669988 width=80) > > (actual time=98.323..322537.605 rows=25197 loops=1) > > Means the planner did what it did, because it estimated there would be > > nearly 5 million rows. However, there were only 25,000. Sorry for not doing the obvious beforehand! I increased the statistics target for some of the columns in some of the tables and then did a vacuum analyze. Rerunning the query gives: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1) -> Nested Loop (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1) -> Nested Loop (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10 loops=1) -> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1) -> Index Scan using dock_pkey on dock (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718 rows=0 loops=25197) Index Cond: (dock.dockid = "outer".id) Filter: (target = '1YC1'::text) -> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..5.97 rows=1 width=187) (actual time=34.221..34.223 rows=1 loops=10) Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text) Total runtime: 171722.964 ms (10 rows) Clearly a big improvement in performance. (One question not directly related to the problem: when looking at the output of explain analyze, I know that one is supposed to start at the bottom and move up. Does that that the index scan on pubchem_compound is being performed first? Or should I start from the innermost line?) However it seems that it could still be improved: -> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1) It looks like theres a big mismatch on the expected and observed costs and times. > The real problem here is that the planner is guessing that it won't take > very long to find 10 rows satisfying the target = '1YC1' condition while > scanning in dockscore_plp.total order. So it chooses a plan that would > have a long total runtime (notice the large cost estimates below the > Limit) expecting that only a small fraction of that total will actually > be expended. The expectation seems a bit off unfortunately :-(. > I can't tell from the given data whether the problem is just an > overestimate of the frequency of target = '1YC1', or if there's an > additional effect. I think that increasing the statistics has improved that. > For example, if that target value tended to only be > associated with larger values of dockscore_plp.total, then a plan like > this could lose big-time because it will have to scan a long way to find > those rows. This is not the case. The value '1YC1' will be associated with both high and low values of dockscore_plp.total What I would like my query to do is this: 1. From dock.target find all rows = '1YC1' 2. Using dock.dockid of these rows, get the corresponding rows in dockscore_plp 3. Using dock.cid from the rows in 2., get the corresponding rows in pubchem_compound 4. Sort and take the top 10 from step 2 (and associated rows in step 3) However now that I have written this it seems that what I really want to do is: 1. From dock.target find all rows = '1YC1' 2. Using dock.dockid of these rows, get the corresponding rows in dockscore_plp 3. Sort and take the top 10 4. Get the corresponding rows from pubchem_compound.cid The problem with this is that step is represented by the dock.cid = pubchem_compound.cid clause. It seems that if I had the cid column in dockscore_plp, then I could do a sort+limit in dockscore_plp and then simply lookup the corresponding (10) rows in pubchem_compound (rather than looking up 960K rows). The downside to this is that there are 4 more tables like dockscore_plp, and I would have to add a cid column to each of them - which seems redundant. Is it useful to increase redundancy to improve performance? Thanks for the pointers, ------------------------------------------------------------------- Rajarshi Guha <rguha@xxxxxxxxxxx> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- There's no problem so bad that you can't add some guilt to it to make it worse. -Calvin