Gustavo Rezende Montesino <gustavo.montesino@xxxxxxxxxxxx> writes: > Being the client in question, I would like to make a little remark: What > we thought could be optimized here at first is on the row estimate of > the index scan; which could take null_frac into account. To put things > into perspective, our similar case in production has a table with 6 > million lines where only 9.5k aren´t null for the join field, an the > over-estimation is throwing away good plans (like ~150ms execution time) > in favor of pretty bad ones (~80s execution time). Please provide a concrete test case for that. AFAIK the null fraction should be accounted for in join size estimates. Here's a little test case showing that it is: regression=# create table t1 as select generate_series(1,1000000) as f1; SELECT 1000000 regression=# analyze t1; ANALYZE regression=# create table t2 as select generate_series(1,1000000) as f1; SELECT 1000000 regression=# analyze t2; ANALYZE regression=# explain select * from t1,t2 where t1.f1=t2.f1; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=30832.00..70728.00 rows=1000000 width=8) Hash Cond: (t1.f1 = t2.f1) -> Seq Scan on t1 (cost=0.00..14425.00 rows=1000000 width=4) -> Hash (cost=14425.00..14425.00 rows=1000000 width=4) -> Seq Scan on t2 (cost=0.00..14425.00 rows=1000000 width=4) (5 rows) regression=# insert into t2 select null from generate_series(1,1000000); INSERT 0 1000000 regression=# analyze t2; ANALYZE regression=# explain select * from t1,t2 where t1.f1=t2.f1; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=30832.00..95727.00 rows=1000000 width=8) Hash Cond: (t2.f1 = t1.f1) -> Seq Scan on t2 (cost=0.00..27862.00 rows=2000000 width=4) -> Hash (cost=14425.00..14425.00 rows=1000000 width=4) -> Seq Scan on t1 (cost=0.00..14425.00 rows=1000000 width=4) (5 rows) The join size estimate is still correct even though it knows there are many more rows in t2. As for inserting a not-null test at the scan level, I'm not exactly convinced that it's a win: regression=# \timing Timing is on. regression=# select count(*) from t1,t2 where t1.f1=t2.f1; count --------- 1000000 (1 row) Time: 562.914 ms regression=# select count(*) from t1,t2 where t1.f1=t2.f1 and t2.f1 is not null; count --------- 1000000 (1 row) Time: 564.896 ms [ ftr, these times are best-of-three-trials ] It's possible that in the case where an explicit sort has to be inserted, reducing the amount of data passing through the sort would be worth doing; but in the general case that's unproven. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance