We have a table "entry" that is partitioned into two sub-tables named "entry_part_new" and "entry_part_old", as described here:
We find that in some cases, when joining to a partitioned table, the planner makes wilding inaccurate estimates of the number of rows that will be returned from the partitioned table.
This is on PostgreSql 8.3.3 with recently analyzed tables. The sql_inheritance and constraint_exclusion settings are on, and both sub-tables have non-overlapping check constraints on the primary key column, id.
Here is an example query and the explain analyze plan. The thing to note here is that the planner is estimating 5.9 millions rows in the outer nested loop, when in fact only 16 rows will be affected.
prod_2=> explain analyze
SELECT e.*
FROM publication_entry_pin as pep
INNER JOIN entry e
ON (pep.entry_id = e.id)
WHERE pep.publication_id = 850
prod_2-> \g
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=3.34..440.53 rows=5949391 width=1276) (actual time=0.116..0.438 rows=16 loops=1)
Join Filter: (pep.entry_id = e.id)
-> Bitmap Heap Scan on publication_entry_pin pep (cost=3.34..19.45 rows=11 width=4) (actual time=0.065..0.133 rows=16 loops=1)
Recheck Cond: (publication_id = 850)
-> Bitmap Index Scan on index_publication_entry_pin_on_publication_id_and_place_index (cost=0.00..3.33 rows=11 width=0) (actual time=0.052..0.052 rows=237 loops=1)
Index Cond: (publication_id = 850)
-> Append (cost=0.00..38.24 rows=3 width=777) (actual time=0.015..0.016 rows=1 loops=16)
-> Index Scan using entry_pkey1 on entry e (cost=0.00..3.27 rows=1 width=777) (actual time=0.001..0.001 rows=0 loops=16)
Index Cond: (e.id = pep.entry_id)
-> Index Scan using entry_pkey on entry_part_old e (cost=0.00..20.09 rows=1 width=725) (actual time=0.004..0.004 rows=0 loops=16)
Index Cond: (e.id = pep.entry_id)
-> Index Scan using entry_part_new_pkey on entry_part_new e (cost=0.00..14.88 rows=1 width=742) (actual time=0.009..0.010 rows=1 loops=16)
Index Cond: (e.id = pep.entry_id)
Total runtime: 0.519 ms
(14 rows)
If I modify the query to directly query the entry_part_new partition table, the plan is much more accurate, estimating 11 rows:
prod_2=> \p
explain analyze
SELECT e.*
FROM publication_entry_pin as pep
INNER JOIN entry_part_new e
ON (pep.entry_id = e.id)
WHERE pep.publication_id = 850
prod_2=> \g
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=3.34..183.29 rows=11 width=742) (actual time=0.102..0.323 rows=16 loops=1)
-> Bitmap Heap Scan on publication_entry_pin pep (cost=3.34..19.45 rows=11 width=4) (actual time=0.078..0.125 rows=16 loops=1)
Recheck Cond: (publication_id = 850)
-> Bitmap Index Scan on index_publication_entry_pin_on_publication_id_and_place_index (cost=0.00..3.33 rows=11 width=0) (actual time=0.060..0.060 rows=237 loops=1)
Index Cond: (publication_id = 850)
-> Index Scan using entry_part_new_pkey on entry_part_new e (cost=0.00..14.88 rows=1 width=742) (actual time=0.009..0.010 rows=1 loops=16)
Index Cond: (e.id = pep.entry_id)
Total runtime: 0.381 ms
(8 rows)
prod_2=>
This mis-estimation at times results in dramatically slower queries, especially when more joins are involved. Because the estimated number of rows is so high, the planner more often selects a seq scan plan in more situations, when it really shouldn't.
So, my questions are:
1.) Is this expected behavior of the partitioning feature? If so, it seems to be a major limitation.
2.) What can I do if I need to use partitioned tables, to help the planner make better estimates?
So far, we've been adding application logic to dynamically re-write the queries to directly query the partitioned sub-tables, in effect, side-stepping Postgres' built-in partitioning feature altogether.
thanks,
Mason