Hi all, Postgres version: 8.2.4 Tables: table_a(a bigint, b bigint, primary key(a, b) ); table_b1(b bigint primary key, more columns...); table_b2(b bigint primary key references table_b1(b), more columns...); table_b1: ~ 27M rows; ~25 more columns; width=309 (as reported by explain select *); table_a: ~400M rows; - column "b" should reference table_b1, but it does not for performance reasons (it is an insert only table); - column "a" distinct values: 1148 - has (a, b) as primary key; - has no additional columns; table_b1: ~40K rows; ~70 more columns; width=1788 (as reported by explain select *); Statistics for the involved columns for each table are attached in files (to preserve the spacing). They were taken after analyzing the relevant table (except for table_b2 where I added the "fiddled" statistics first and then remembered to analyze fresh, resulting in the "non_fiddled" version, which gives the same result as the fiddled one). The problem query is: prepare test_001(bigint) as SELECT tb.* FROM table_a ta JOIN table_b2 tb ON ta.b=tb.b WHERE ta.a = $1 ORDER BY ta.a, ta.b limit 10; Explain gives Plan 1 (see attached plans.txt) If I set enable_hashjoin=off and enable_mergejoin=off, I get Plan 2 (again, see plans.txt). The difference is a 30x improvement in the second case... (I actually forgot to account for cache effects, but later rerun the queries multiple times and the timings are proportional). Additionally, if I replace table_b2 with table_b1 in the query, I get Plan 3 (with reasonable execution time) with both enable_hashjoin and enable_mergejoin on. So there is something which makes table_b2 different from table_b1 for planning purposes, but I could not identify what that is... they have differences in statistics, but fiddling with the stats gave me no difference in the plan. Looking at Plan 2, it looks like the "limit" step is estimating wrongly it's cost. I guessed that it does that because it thinks the "b" values selected from table_a for a given "a" span a larger range than the "b" values in table_b2, because the "b" values in table_b2 are a (relatively small) subset of the "b" values in table_a. But this is not the case, the query only gets "a" values for which all the "b" values in table_a will be found in table_b2. Of course the planner has no way to know this, but then I think it is not the case, as I tried to copy the histogram statistics in pg_statistic for the column "b" from the entry for table_b1 (which contains the whole span of "b" values) to the entry for table_b2, with no change in the plan. Just for the record, this query is just a part of a more complex one, which joins in bigger tables, resulting in even worse performance, but I tracked it down to refusing the nested loop to be the problem. Is there anything I could do to convince the planner to use here the nested loop plan ? Thanks, Csaba.
attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+-----------+-----------+------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+------------- a | 0 | 8 | 1148 | {31826743,31855101,31855343,31854918,31856328,31861573,31855122,31855130,31855189,31856426} | {0.005,0.00466667,0.00433333,0.004,0.004,0.004,0.00366667,0.00366667,0.00366667,0.00366667} | {31734956,31854364,31854732,31855162,31855660,31857144,31858109,31858965,31859762,31860576,31861566} | 0.999608 b | 0 | 8 | -1 | | | {63977,36878147,42247866,42548692,42812320,46992026,51444368,55977972,56607708,59496742,68530614} | 0.602959
attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+-----------+-----------+------------+------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- b | 0 | 8 | -1 | | | {21302091,26264614,32786793,35823704,42390431,43618589,47178753,48479147,49297942,50285891,51142864,52087189,52975318,53873562,54752183,55651893,56643636,57526056,58344883,59143872,60073970,60934277,61858188,62759965,63699252,64585260,65587784,66517871,67368007,68219640,69112491} | 0.227599
attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+-----------+-----------+------------+------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- b | 0 | 8 | -1 | | | {21310803,24281419,32805654,35860260,42573594,45902732,47896008,48922471,49784797,50658309,51556908,52439479,53316630,54186754,55044558,55945438,56836595,57618314,58587628,59478122,60223564,61063339,61995945,62876263,63769015,64717353,65577667,66475226,67420705,68286405,69034532} | 0.99966
Plan 1: db> explain analyze execute test_001(31855344); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=333067.77..333067.79 rows=10 width=1804) (actual time=652.739..652.741 rows=2 loops=1) -> Sort (cost=333067.77..333067.86 rows=36 width=1804) (actual time=652.736..652.736 rows=2 loops=1) Sort Key: ta.a, ta.b -> Hash Join (cost=3364.30..333066.84 rows=36 width=1804) (actual time=648.159..652.709 rows=2 loops=1) Hash Cond: (ta.b = tb.b) -> Index Scan using pk_table_a on table_a ta (cost=0.00..324786.18 rows=388532 width=16) (actual time=454.389..460.138 rows=2 loops=1) Index Cond: (a = $1) -> Hash (cost=524.58..524.58 rows=41658 width=1788) (actual time=187.192..187.192 rows=41671 loops=1) -> Seq Scan on table_b2 tb (cost=0.00..524.58 rows=41658 width=1788) (actual time=5.540..113.387 rows=41671 loops=1) Total runtime: 652.980 ms Plan 2: db> explain analyze execute test_001(31855344); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=0.00..498511.80 rows=10 width=1804) (actual time=17.729..21.672 rows=2 loops=1) -> Nested Loop (cost=0.00..1794642.48 rows=36 width=1804) (actual time=17.729..21.671 rows=2 loops=1) -> Index Scan using pk_table_a on table_a ta (cost=0.00..324880.88 rows=388638 width=16) (actual time=0.146..0.198 rows=2 loops=1) Index Cond: (a = $1) -> Index Scan using pk_table_b2 on table_b2 tb (cost=0.00..3.77 rows=1 width=1788) (actual time=10.729..10.731 rows=1 loops=2) Index Cond: (ta.b = tb.b) Total runtime: 21.876 ms Plan 3: db> explain analyze execute test_001(31855344); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=0.00..853.14 rows=10 width=325) (actual time=20.117..28.104 rows=2 loops=1) -> Nested Loop (cost=0.00..2024323.48 rows=23728 width=325) (actual time=20.116..28.101 rows=2 loops=1) -> Index Scan using pk_table_a on table_a ta (cost=0.00..327561.01 rows=388684 width=16) (actual time=0.023..0.027 rows=2 loops=1) Index Cond: (a = $1) -> Index Scan using pk_table_b1 on table_b1 tb (cost=0.00..4.35 rows=1 width=309) (actual time=14.032..14.034 rows=1 loops=2) Index Cond: (ta.b = tb.b) Total runtime: 28.200 ms
attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+-----------+-----------+------------+------------------+-------------------+------------------------------------------------------------------------------------------------------+------------- b | 0 | 8 | -1 | | | {53220192,53236707,53240765,53244670,53248556,53253116,53257333,53261450,53265687,56984462,68866286} | 0.997929
---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster