We have traced this to the *addition* of a two-column index. The two tables in question both have single-column indexes on two foreign keys, say columns A and B. The query joins the two large tables on A and B. With only the two indexes, the query plan does a bitmap AND on the index scan results and performance is stable. I added an index on (A,B), and this caused the planner to use the new index, but I was never able to get the query to complete. In one instance I let it run 18 hours. The onlly difference was the addition of the index Summary: - With index on (A,B) -- query time is "infinite" - Without index on (A,B), relying on individual indexes and bitmap AND -- query time is about 4 minutes (as expected given the data volume) Does this sound like a bug in the query planner? > -----Original Message----- > From: pgsql-performance-owner@xxxxxxxxxxxxxx [mailto:pgsql- > performance-owner@xxxxxxxxxxxxxx] On Behalf Of Jim Garrison > Sent: Wednesday, September 25, 2013 8:58 AM > To: pgsql-performance@xxxxxxxxxxxxxx > Subject: Troubleshooting query performance issues > > I spent about a week optimizing a query in our performance-testing > environment, which has hardware similar to production. > > I was able to refactor the query and reduce the runtime from hours to about > 40 seconds, through the use of CTEs and a couple of new indexes. > > The database was rebuilt and refreshed with the very similar data from > production, but now the query takes hours again. > > In the query plan, it is clear that the row count estimates are WAY too low, > even though the statistics are up to date. Here's a sample query plan: > > CTE Scan on stef (cost=164.98..165.00 rows=1 width=38) > CTE terms > -> Nested Loop (cost=0.00..62.40 rows=1 width=12) > -> Index Scan using term_idx1 on term t (cost=0.00..52.35 rows=1 > width=12) > Index Cond: (partner_id = 497) > Filter: (recalculate_district_averages_yn AND (NOT is_deleted_yn)) > -> Index Scan using growth_measurement_window_fk1 on > growth_measurement_window gw (cost=0.00..10.04 rows=1 width=4) > Index Cond: (term_id = t.term_id) > Filter: (test_window_complete_yn AND (NOT is_deleted_yn) AND > ((growth_window_type)::text = 'DISTRICT'::text)) > CTE stef > -> Nested Loop (cost=0.00..102.58 rows=1 width=29) > Join Filter: ((ssef.student_id = terf.student_id) AND (ssef.grade_id = > terf.grade_id)) > -> Nested Loop (cost=0.00..18.80 rows=3 width=28) > -> CTE Scan on terms t (cost=0.00..0.02 rows=1 width=8) > -> Index Scan using student_school_enrollment_fact_idx2 on > student_school_enrollment_fact ssef (cost=0.00..18.74 rows=3 width=20) > Index Cond: ((partner_id = t.partner_id) AND (term_id = > t.term_id)) > Filter: primary_yn > -> Index Scan using test_event_result_fact_idx3 on > test_event_result_fact terf (cost=0.00..27.85 rows=4 width=25) > Index Cond: ((partner_id = t.partner_id) AND (term_id = t.term_id)) > Filter: growth_event_yn > > The estimates in the first CTE are correct, but in the second, the scan on > student_school_enrollment_fact will return about 1.5 million rows, and the > scan on test_event_result_fact actually returns about 1.1 million. The top > level join should return about 900K rows. I believe the fundamental issue is > that the CTE stef outer nested loop should be a merge join instead, but I > cannot figure out why the optimizer is estimating one row when it has the > statistics to correctly estimate the count. > > What would cause PG to so badly estimate the row counts? > > I've already regenerated the indexes and re-analyzed the tables involved. > > What else can I do to find out why it's running so slowly? > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance